|
|
|---|
|
This work has been supported by the National Science Foundation
under an Educational Infrastructure grant, CDA-9017953 and
developed as part of the High Performance Scientific Computing
(HPSC) project at the University of Colorado at Boulder.
Watch this space!
A vector computer or vector processor is a machine designed to efficiently handle arithmetic operations on elements of arrays, called vectors. Such machines are especially useful in high-performance scientific computing, where matrix and vector arithmetic are quite common. The Cray Y-MP and the Convex C3880 are two examples of vector processors used today.
This tutorial provides a general overview of the architecture of a vector computer. This includes an introduction to vectors and vector arithmetic, a discussion of performance measurements used to evaluate this type of machine, and a comparison of the characteristics of particular vector computers. A brief history of vector processors is provided as well, with a focus on the Cray vector architectures.
|
Vectors and vector arithmetic
To understand the concepts behind a vector processor, we first present a short review of vectors and vector arithmetic. |
|
Vector computing
architectural concepts
We continue by showing the application of these ideas to the hardware in vector processors. |
|
Vector computing performance
We then discuss performance and performance metrics, providing these figures for a few specific vector processors. |
|
The evolution of vector
computers
Finally we relate the history of some vector processors. In particular, we focus on Cray vector processors. |
DIMENSION V(N)
where N is an integer variable holding the value of the
length of the vector. Throughout this tutorial,
we use the following terms almost interchangeably: vector, array, list.
Arithmetic operations may be performed on vectors. Two vectors are added by adding corresponding elements:
DO I=1,N
S(I) = X(I) + Y(I)
ENDDO
where s is the vector representing the final sum and S,
X, and Y have been declared as arrays of
dimension N. This operation is sometimes called
elementwise addition.
Similarly, the subtraction of two vectors, x - y,
is an elementwise operation.
Vector computing architectural concepts
A vector computer contains a set of special arithmetic units called
pipelines. These
pipelines overlap the execution of the different parts
of an arithmetic operation on the elements of the vector,
producing a more efficient execution of the arithmetic operation.
In many respects, a pipeline is similar to an assembly
line in a factory where different steps of the assembly
of an automobile, for example, are performed at different stages
of the line.
In this section, we discuss how a vector pipeline operates, the advantages of this type of architecture, and other architectural features found in vector processors.
Figure 1 shows the step-by-step example of such an addition. The numbers to be added are x = 1234.00 and y = -567.8. In deference to the human reader, these are represented in decimal notation with a mantissa of four digits.
Now consider this scalar addition performed on all the elements of a pair of vectors (arrays) of length n. Each of the six stages needs to be executed for every pair of elements. If each stage of the execution takes tau units of time, then each addition takes 6*tau units of time (not counting the time required to fetch and decode the instruction itself or to fetch the two operands). So the number of time units required to add all the elements of the two vectors in a serial fashion would be Ts = 6*n*tau. These execution stages are shown in figure 2 with respect to time.
Observe that it still takes 6*tau units of time to complete the sum of the first pair of elements, but that the sum of the next pair is ready in only tau more units of time. And this pattern continues for each succeeding pair. This means that the time, Tp, to do the pipelined addition of two vectors of length n is
Comparing the equations for Ts and Tp, it is clear that
In this discussion, we have assumed that the floating-point addition requires six stages and takes 6*tau units of time. There is nothing magic about this number 6; in fact, for some architectures, the number of stages in a floating-point addition may be more or less than six. Further, the individual stages may be quite different from the ones listed in section on pipelined addition. The operations at each stage of a pipeline for floating-point multiplication are slightly different than those for addition; a multiplication pipeline may even have a different number of stages than an addition pipeline. There may also be pipelines for integer operations. As shown in figure 8, pipelines to perform vector operations on the Cray-1 have from one to fourteen stages, depending on the type of operation performed by the pipeline.
Some vector architectures provide greater efficiency by allowing the output of one pipeline to be chained directly into another pipeline. This feature is called chaining and eliminates the need to store the result of the first pipeline before sending it into the second pipeline. Figure 5 demonstrates the use of chaining in the computation of a saxpy vector operation:
Chaining can double the number of floating-point operations that are
done in tau units of time. Once both the multiplication
and addition pipelines have been filled, one floating-point multiplication
and one floating-point addition (a total of two floating-point operations)
are completed every tau time units.
Because of the ability to overlap memory accesses and the possible reuse of vector processors, vector-register vector processors are usually more efficient than memory-memory vector processors. However as the length of the vectors in a computation increase, this difference in efficiency between the two types of architectures is diminished. In fact, the memory-memory vector processors may prove more efficient if the vectors are long enough. Nevertheless, experience has shown that shorter vectors are more commonly used.
One memory access (load or store) of a data value in a memory bank takes several clock cycles to complete. Each memory bank allows only one data value to be read or stored in a single memory access, but more than one memory bank may be accessed at the same time. When the elements of a vector stored in an interleaved memory are read into a vector register, the reads are staggered across the memory banks so that one vector element is read from a bank per clock cycle. If one memory access takes n clock cycles, then n elements of a vector may be fetched at a cost of one memory access; this is n times faster than the same number of memory accesses to a single bank.
For typical vector architectures, the value of tau
(the time to complete one pipeline stage) is equivalent
to one clock cycle of the machine
Let k represent the number of tau time units the same sequential operation would take (or the number of stages in the pipeline). Then the time to execute that sequential operation on a vector of length n is
A startup time is also required; this is the time needed to get the operation going. In a sequential machine, there may some overhead required to set up a loop to repeat the same floating-point operation for an entire vector; the elements of the vector also need to be fetched from memory. If we let Ss be the number of tau time units for the sequential startup time, then Ts must include this time:
As the length of the vector gets larger (as n goes to infinity), the startup time becomes negligible in both cases. This means that
There are a number of other terms to describe the performance of vector processors or vector computers. The following list introduces some of these:
Table 1 provides some performance characteristics for some of the vector computers discussed later in this section. The values of R_infinity and n_1/2 are for the elementwise multiplication of two vectors. These values and those of other vector computers are also discussed in chapter 10 on Computer Performance in [Fosdicketal:1996].
The evolution of vector computers
One of the first supercomputers with built-in vector processors
was the CDC Star 100. The ideas for this machine were first
conceived in 1964 and were based on Iverson's APL programming
language [Iverson:1962]. Lawrence Livermore
National Laboratories contracted to have this machine built for
them in 1967, but the first machine was not delivered until 1974.
By that time, the magnetic core memory contained in the machine
was considered obsolete. The TI-ASC (the Advanced Scientific
Computer by Texas Instruments) was a similar machine on the market
at about the same time.
The CDC Cyber 205 is based on the concepts originated for the CDC Star 100; the first commercial model was delivered in 1981. This supercomputer is a memory-memory vector machine; it fetches vectors directly from memory to fill the pipelines and stores the pipeline results directly to memory; it has no vector registers. As a result, the startup times for the vector pipelines are large, as is reflected in the large value of n_1/2 for the CDC Cyber 205 in table 1. This machine contains up to four general-purpose pipelines, instead of pipelines designed for specific operations. It also provides both gather and scatter operations. A later shared-memory multiprocessor version of the CDC Cyber 205 is the ETA-10.
More recent vector computers include the IBM 3090, the Alliant FX/8 (a shared-memory multiprocessor with 8 CPU's, each with an attached vector processor), and the computers in the Convex series (each of which may also have multiple CPU's). Perhaps the most successful group of vector processors has been the family of Cray vector computers.
The Cray-1, like other Cray vector processors that followed, was a vector-register machine. This type of machine fills the vector pipelines from the vector element values currently in the vector registers. This reduces the time to fill the pipelines (the startup time) for vector arithmetic operations; the vector registers can even be filled while the pipelines are performing some other operation. The vector results may be put back into a vector register after the completion of the operation, or they may be piped directly into another pipeline for an additional vector operation (chaining).
The Cray-1 was the first machine to use chaining. For vector processors that employ chaining techniques, not only is the startup time for each operation smaller than that of a comparable memory-memory machine, but also two floating-point operations may be performed at the same time, thereby doubling the number of Mflops.
Each vector register on the Cray-1 and on most later Cray
vector processors contains 64 single precision elements.
The Cray-1 and later Cray vector processors have twelve different
pipelines or functional units. These are of four main types:
One of the problems with the Cray-1 was that it only allowed one memory
read and write per clock cycle. Vector processors with this
limitation may only read one vector and write one vector result at
the same time. When more than one read (or write) is needed for the
same operation, one of the reads (or the writes) must stall waiting
for the other to complete. Consider the elementwise
multiplication of two vectors of length greater than 64:
The MP portion of the name refers to multiprocessing.
The Cray X-MP is a shared-memory multiprocessor
with each CPU controlling its own set of vector processors.
Originally, this machine could be purchased in
a 2-processor version and later in a 4-processor version.
The Cray X-MP is capable of two reads and two writes per
clock cycle. This allows a faster load and store between the memory
and the vector registers and prevents some pipeline delays. Chaining
on this machine allows all three floating-point pipelines to
operate simultaneously.
In 1989, a new company called Cray Computer Corporation and
headed by Seymour Cray split from Cray Research, Inc. Each
company continued to develop supercomputers using the Cray name. The
Cray Computer Corporation followed the ideas derived for the Cray-2.
The Cray Y-MP/M90 was introduced in 1991. This model is more
like the Cray Y-MP than the Cray C90
but provides the ability to handle larger memory addresses and
has a much larger memory. Partly because of the additional
memory, the Cray Y-MP/M90 operates at about half
the speed of the Cray C90. Originally limited to 8 processors,
the Cray Y-MP/M90 has since been extended to handle up
to sixteen processors.
The Cray Y-MP models are the fastest of the Cray models discussed so far.
The clock cycle time of both the Cray Y-MP and the Cray Y-MP/M90 is 6 nsec.
The Cray C90 has a clock cycle of 4.2 nsec.
Table 2
summarizes the characteristics of three
Cray machines discussed here: the Cray-1, the Cray X-MP, and the
Cray Y-MP/832. Likewise,
table 3
compares the
performance of these three Cray models. Similar tables are provided
in chapter 10 on Computer Performance in [Fosdicketal:1996].
The processors of this machine are completely surrounded by
a liquid for cooling purposes. When
encased in plastic, the appearance of the Cray-3 is similar to a
large aquarium with
the many wires attached to the processors gently swaying with the flow
of the liquid. For this reason, this machine is sometimes called a
goldfish bowl.
CSCI 4576/4586:
Return to Home Page for CSCI 4576/4586
CU CS Home:
Return to CU Boulder Computer Science Dept Home Page
Figure 8
shows how these pipeline groups are distributed.
The number of stages in a pipeline is given in parentheses
in the box representing that pipeline in the diagram. Notice that
a floating-point reciprocal approximation pipeline is used
to implement floating-point division; x/y is computed as
x * (1/y).
The Cray X-MP
The Cray X-MP, first delivered in 1983, is faster than the Cray-1.
The architecture is similar to that of the Cray-1, but there is
more support for overlapped operations along with multiple memory
pipelines. The early models of the Cray X-MP have a faster clock
than the Cray-1: 9.5 nsec versus 12.5 nsec. Later models of the
Cray X-MP have a clock cycle of 8.5 nsec.
The Cray-2
In 1985, the Cray-2 was introduced as a completely redesigned
architecture.
Available as a multiprocessor
with up to four processors, the Cray-2 does not support chaining.
With only one memory pipeline per processor, there is a
greater memory latency cost. Thus the Cray-2 is an improvement on the
Cray X-MP only for those problems that require large amounts of memory.
The Cray Y-MP
Following along the proven lines of the Cray X-MP, Cray Research,
Inc.~announced the Cray Y-MP in 1987 as a faster rendition of the
Cray X-MP with up to eight processors. A later version, called
the Cray C90, was announced in 1990,
allowing up to sixteen processors.
The Cray-3
In the meantime, the group at Cray Computer Corporation continued
work begun before the split to develop the Cray-3 (based on the
Cray-2). This process took about ten years total, and the machine
was never sold commercially. In 1993, one Cray-3 was placed in the
supercomputing center at NCAR for research purposes. It contains
four processors and has a clock cycle time of 2.1 nsec.
University of Colorado,
Boulder, CO 80309-0430
The HPSC Group:
jessup@cs.colorado.edu,