Tutorial: General Architecture of Vector Processors
 
L. D. Fosdick, C. J. C. Schauble, and E. R. Jessup

This work has been supported by the National Science Foundation
under an Educational Infrastructure grant, CDA-9017953 and
developed as part of the High Performance Scientific Computing
(HPSC) project at the University of Colorado at Boulder.

Copyright © 1999 by the HPSC Group of the University of Colo rado

UNDER CONSTRUCTION! Currently missing images, imagemaps, tables, figures, and applets.

Watch this space!


A vector computer or vector processor is a machine designed to efficiently handle arithmetic operations on elements of arrays, called vectors. Such machines are especially useful in high-performance scientific computing, where matrix and vector arithmetic are quite common. The Cray Y-MP and the Convex C3880 are two examples of vector processors used today.

This tutorial provides a general overview of the architecture of a vector computer. This includes an introduction to vectors and vector arithmetic, a discussion of performance measurements used to evaluate this type of machine, and a comparison of the characteristics of particular vector computers. A brief history of vector processors is provided as well, with a focus on the Cray vector architectures.

Vectors and vector arithmetic
To understand the concepts behind a vector processor, we first present a short review of vectors and vector arithmetic.
Vector computing architectural concepts
We continue by showing the application of these ideas to the hardware in vector processors.
Vector computing performance
We then discuss performance and performance metrics, providing these figures for a few specific vector processors.
The evolution of vector computers
Finally we relate the history of some vector processors. In particular, we focus on Cray vector processors.


Vectors and vector arithmetic

A vector, v, is a list of elements

v = ( v1, v2, v3, ..., vn ),

transposed. The length of a vector is defined as the number of elements in that vector; so the length of v is n. When mapping a vector to a computer program, we declare the vector as an array of one dimension. In Fortran, we declare v by the statement
       DIMENSION  V(N)
where N is an integer variable holding the value of the length of the vector. Throughout this tutorial, we use the following terms almost interchangeably: vector, array, list.

Arithmetic operations may be performed on vectors. Two vectors are added by adding corresponding elements:

s = x + y = ( x1+y1, x2+y2, ..., xn+yn ).

In Fortran, vector addition could be performed by the following code
       DO  I=1,N
          S(I) = X(I) + Y(I)
       ENDDO
where s is the vector representing the final sum and S, X, and Y have been declared as arrays of dimension N. This operation is sometimes called elementwise addition. Similarly, the subtraction of two vectors, x - y, is an elementwise operation.


Vector computing architectural concepts

A vector computer contains a set of special arithmetic units called pipelines. These pipelines overlap the execution of the different parts of an arithmetic operation on the elements of the vector, producing a more efficient execution of the arithmetic operation. In many respects, a pipeline is similar to an assembly line in a factory where different steps of the assembly of an automobile, for example, are performed at different stages of the line.

In this section, we discuss how a vector pipeline operates, the advantages of this type of architecture, and other architectural features found in vector processors.

The stages of a floating-point operation

Consider the steps or stages involved in a floating-point addition on a sequential machine with IEEE arithmetic hardware: s = x + y.

Figure 1 shows the step-by-step example of such an addition. The numbers to be added are x = 1234.00 and y = -567.8. In deference to the human reader, these are represented in decimal notation with a mantissa of four digits.

Now consider this scalar addition performed on all the elements of a pair of vectors (arrays) of length n. Each of the six stages needs to be executed for every pair of elements. If each stage of the execution takes tau units of time, then each addition takes 6*tau units of time (not counting the time required to fetch and decode the instruction itself or to fetch the two operands). So the number of time units required to add all the elements of the two vectors in a serial fashion would be Ts = 6*n*tau. These execution stages are shown in figure 2 with respect to time.

An arithmetic pipeline

Suppose the addition operation described in the last subsection is pipelined; that is, one of the six stages of the addition for a pair of elements is performed at each stage in the pipeline. Each stage of the pipeline has a separate arithmetic unit designed for the operation to be performed at that stage. Once stage A has been completed for the first pair of elements, these elements can be moved to the next stage (B) while the second pair of elements moves into the first stage (A). Again each stage takes tau units of time. Thus, the flow through the pipeline can be viewed as shown in figure 3, where the stages of the pipeline addition execute with respect to time as in figure 4. (Compare figure 2 to figure 4.)

Observe that it still takes 6*tau units of time to complete the sum of the first pair of elements, but that the sum of the next pair is ready in only tau more units of time. And this pattern continues for each succeeding pair. This means that the time, Tp, to do the pipelined addition of two vectors of length n is

Tp = 6*tau + (n-1)*tau = (n + 5)*tau.

The first 6*tau units of time are required to fill the pipeline and to obtain the first result. After the last result, xn + yn, is completed, the pipeline is emptied out or flushed.

Comparing the equations for Ts and Tp, it is clear that

(n + 5)*tau < 6*n*tau, for n > 1.

Thus, this pipelined version of addition is faster than the serial version by almost a factor of the number of stages in the pipeline. This is an example of what makes vector processing more efficient than scalar processing. For large n, the pipelined addition for this sample pipeline is about six times faster than scalar addition.

In this discussion, we have assumed that the floating-point addition requires six stages and takes 6*tau units of time. There is nothing magic about this number 6; in fact, for some architectures, the number of stages in a floating-point addition may be more or less than six. Further, the individual stages may be quite different from the ones listed in section on pipelined addition. The operations at each stage of a pipeline for floating-point multiplication are slightly different than those for addition; a multiplication pipeline may even have a different number of stages than an addition pipeline. There may also be pipelines for integer operations. As shown in figure 8, pipelines to perform vector operations on the Cray-1 have from one to fourteen stages, depending on the type of operation performed by the pipeline.

Vector registers

Some vector computers, such as the Cray Y-MP, contain vector registers. A general purpose or a floating-point register holds a single value; vector registers contain several elements of a vector at one time. For example, the Cray Y-MP vector registers contain 64 elements while the Cray C90 vector registers hold 128 elements. The contents of these registers may be sent to (or received from) a vector pipeline one element at a time.

Scalar registers

Scalar registers behave like general purpose or floating-point registers; they hold a single value. However, these registers are configured so that they may be used by a vector pipeline; the value in the register is read once every tau units of time and put into the pipeline, just as a vector element is released from the vector pipeline. This allows the elements of a vector to be operated on by a scalar. To compute

y = 2.5 * x,

the 2.5 is stored in a scalar register and fed into the vector multiplication pipeline every tau units of time in order to be multiplied by each element of x to produce y.

Chaining

Figure 4 is a diagram of a single pipeline. As mentioned in section on pipelined addition, most vector architectures have more than one pipeline; they may also contain different types of pipelines.

Some vector architectures provide greater efficiency by allowing the output of one pipeline to be chained directly into another pipeline. This feature is called chaining and eliminates the need to store the result of the first pipeline before sending it into the second pipeline. Figure 5 demonstrates the use of chaining in the computation of a saxpy vector operation:

a*x + y,

where x and y are vectors and a is a scalar constant.

Chaining can double the number of floating-point operations that are done in tau units of time. Once both the multiplication and addition pipelines have been filled, one floating-point multiplication and one floating-point addition (a total of two floating-point operations) are completed every tau time units. Conceptually, it is possible to chain more than two functional units together, providing an even greater speedup. However this is rarely (if ever) done due to difficult timing problems.

Scatter and gather operations

Sometimes, only certain elements of a vector are needed in a computation. Most vector processors are equipped to pick out the appropriate elements (a gather operation) and put them together into a vector or a vector register. If the elements to be used are in a regularly-spaced pattern, the spacing between the elements to be gathered is called the stride. For example, if the elements

x1, x5, x9, x13, ..., x[4*floor((n-1)/4)+1]

are to be extracted from the vector

( x1, x2, x3, x4, x5, x6, ..., xn )

for some vector operation, we say the stride is equal to 4. A scatter operation reformats the output vector so that the elements are spaced correctly. Scatter and gather operations may also be used with irregularly-spaced data.

Vector-register vector processors

If a vector processor contains vector registers, the elements of the vector are read from memory directly into the vector register by a load vector operation. The vector result of a vector operation is put into a vector register before it is stored back in memory by a store vector operation; this permits it to be used in another computation without needing to be reread, and it allows the store to be overlapped by other operations. On these machines, all arithmetic or logical vector operations are register-register operations; that is, they are only performed on vectors that are already in the vector registers. For this reason, these machines are called vector-register vector processors.

Memory-memory vector processors

Another type of vector processor allows the vector operands to be fetched directly from memory to the different vector pipelines and the results to be written directly to memory; these are called memory-memory vector processors. Because the elements of the vector need to come from memory instead of a register, it takes a little longer to get a vector operation started; this is due partly to the cost of a memory access. One example of a memory-memory vector processor is the CDC Cyber 205.

Because of the ability to overlap memory accesses and the possible reuse of vector processors, vector-register vector processors are usually more efficient than memory-memory vector processors. However as the length of the vectors in a computation increase, this difference in efficiency between the two types of architectures is diminished. In fact, the memory-memory vector processors may prove more efficient if the vectors are long enough. Nevertheless, experience has shown that shorter vectors are more commonly used.

Interleaved memory banks

To allow faster access to vector elements stored in memory, the memory of a vector processor is often divided into memory banks. Interleaved memory banks associate successive memory addresses with successive banks cyclically; thus word 0 is stored in bank 0, word 1 is in bank 1, ..., word n-1 is in bank n-1, word n is in bank 0, word n+1 is in bank 1, ..., etc., where n is the number of memory banks. As with many other computer architectural features, n is usually a power of 2:

n = 2^k,

where k = 1, 2, 3, or 4.

One memory access (load or store) of a data value in a memory bank takes several clock cycles to complete. Each memory bank allows only one data value to be read or stored in a single memory access, but more than one memory bank may be accessed at the same time. When the elements of a vector stored in an interleaved memory are read into a vector register, the reads are staggered across the memory banks so that one vector element is read from a bank per clock cycle. If one memory access takes n clock cycles, then n elements of a vector may be fetched at a cost of one memory access; this is n times faster than the same number of memory accesses to a single bank.


Vector computing performance

For typical vector architectures, the value of tau (the time to complete one pipeline stage) is equivalent to one clock cycle of the machineOn some machines, it may be equal to two or more clock cycles.. Once a pipeline like the one shown in figure 3 has been filled, it generates one result for each tau units of time, that is, for each clock cycle. This means the hardware performs one floating-point operation per clock cycle.

Let k represent the number of tau time units the same sequential operation would take (or the number of stages in the pipeline). Then the time to execute that sequential operation on a vector of length n is

Ts = k*n*tau,

and the time to perform the pipelined version is

Tp = k*tau + (n-1)*tau = (n + k - 1)*tau.

Again for n > 1, Ts > Tp.

A startup time is also required; this is the time needed to get the operation going. In a sequential machine, there may some overhead required to set up a loop to repeat the same floating-point operation for an entire vector; the elements of the vector also need to be fetched from memory. If we let Ss be the number of tau time units for the sequential startup time, then Ts must include this time:

Ts = (Ss + k*n)*tau.

In a pipelined machine, the flow from the vector registers or from memory to the pipeline needs to be started; call this time quantity Sp. Another overhead cost, k*tau time units, is the time needed to initially fill the pipeline. Hence, Tp must include the startup time for the pipelined operation; thus,

Tp = (Sp + k)*tau + (n - 1)*tau

or

Tp = (Sp + k + n - 1)*tau.

As the length of the vector gets larger (as n goes to infinity), the startup time becomes negligible in both cases. This means that

Ts --> k*n*tau

while

Tp --> n*tau.

Thus, for large n, Ts is k times larger than Tp.

There are a number of other terms to describe the performance of vector processors or vector computers. The following list introduces some of these:

Figures 6 and 7 show the relationship between these terms for a vector machine with tau = 6 nanoseconds and Sp = 16*tau . This is an idealized picture; there are slight drops in the rate curve when n is equal to a multiple of the vector-register length.

Table 1 provides some performance characteristics for some of the vector computers discussed later in this section. The values of R_infinity and n_1/2 are for the elementwise multiplication of two vectors. These values and those of other vector computers are also discussed in chapter 10 on Computer Performance in [Fosdicketal:1996].


The evolution of vector computers

One of the first supercomputers with built-in vector processors was the CDC Star 100. The ideas for this machine were first conceived in 1964 and were based on Iverson's APL programming language [Iverson:1962]. Lawrence Livermore National Laboratories contracted to have this machine built for them in 1967, but the first machine was not delivered until 1974. By that time, the magnetic core memory contained in the machine was considered obsolete. The TI-ASC (the Advanced Scientific Computer by Texas Instruments) was a similar machine on the market at about the same time.

The CDC Cyber 205 is based on the concepts originated for the CDC Star 100; the first commercial model was delivered in 1981. This supercomputer is a memory-memory vector machine; it fetches vectors directly from memory to fill the pipelines and stores the pipeline results directly to memory; it has no vector registers. As a result, the startup times for the vector pipelines are large, as is reflected in the large value of n_1/2 for the CDC Cyber 205 in table 1. This machine contains up to four general-purpose pipelines, instead of pipelines designed for specific operations. It also provides both gather and scatter operations. A later shared-memory multiprocessor version of the CDC Cyber 205 is the ETA-10.

More recent vector computers include the IBM 3090, the Alliant FX/8 (a shared-memory multiprocessor with 8 CPU's, each with an attached vector processor), and the computers in the Convex series (each of which may also have multiple CPU's). Perhaps the most successful group of vector processors has been the family of Cray vector computers.

The Cray-1

The first Cray computer, brought out in 1976, was called the Cray-1. It was designed for supercomputing with pipelined vector arithmetic units. Besides being a vector computer, it was the fastest scalar machine at the time that it was introduced. Like the CDC Cyber 205, the Cray-1 provided scatter and gather operations; furthermore, it allowed non-unit strides through vectors using these operations.

The Cray-1, like other Cray vector processors that followed, was a vector-register machine. This type of machine fills the vector pipelines from the vector element values currently in the vector registers. This reduces the time to fill the pipelines (the startup time) for vector arithmetic operations; the vector registers can even be filled while the pipelines are performing some other operation. The vector results may be put back into a vector register after the completion of the operation, or they may be piped directly into another pipeline for an additional vector operation (chaining).

The Cray-1 was the first machine to use chaining. For vector processors that employ chaining techniques, not only is the startup time for each operation smaller than that of a comparable memory-memory machine, but also two floating-point operations may be performed at the same time, thereby doubling the number of Mflops.

Each vector register on the Cray-1 and on most later Cray vector processors contains 64 single precision elements.An exception is the Cray C90 with vector registers containing 128 elements. Each single precision element or word contains 64 bits; this is the equivalent of double precision values on most other machines. The Cray vector processors do not use standard IEEE floating-point representation.

The Cray-1 and later Cray vector processors have twelve different pipelines or functional units. These are of four main types:

Figure 8 shows how these pipeline groups are distributed. The number of stages in a pipeline is given in parentheses in the box representing that pipeline in the diagram. Notice that a floating-point reciprocal approximation pipeline is used to implement floating-point division; x/y is computed as x * (1/y).

One of the problems with the Cray-1 was that it only allowed one memory read and write per clock cycle. Vector processors with this limitation may only read one vector and write one vector result at the same time. When more than one read (or write) is needed for the same operation, one of the reads (or the writes) must stall waiting for the other to complete. Consider the elementwise multiplication of two vectors of length greater than 64:

s = x * y,

where s1 = x1 * y1, s2 = x2 * y2, etc. To begin, two vector registers are filled from memory. One contains the first 64 elements of x; the other, the first 64 elements of y. As these vectors are pushed through the pipeline, the results for s go into a third vector register and from there are stored in memory. After the first 64 elements of each vector have been processed, both input vector registers must be refilled from memory and the result vector register must be written to memory. Since only one read and one write can be done per cycle, elements of both input vectors cannot be read in at the same time; so the pipeline has to delay waiting for one of the read operations to finish. This accounts for the drops in the rate curve mentioned earlier.

The Cray X-MP

The Cray X-MP, first delivered in 1983, is faster than the Cray-1. The architecture is similar to that of the Cray-1, but there is more support for overlapped operations along with multiple memory pipelines. The early models of the Cray X-MP have a faster clock than the Cray-1: 9.5 nsec versus 12.5 nsec. Later models of the Cray X-MP have a clock cycle of 8.5 nsec.

The MP portion of the name refers to multiprocessing. The Cray X-MP is a shared-memory multiprocessor with each CPU controlling its own set of vector processors. Originally, this machine could be purchased in a 2-processor version and later in a 4-processor version.

The Cray X-MP is capable of two reads and two writes per clock cycle. This allows a faster load and store between the memory and the vector registers and prevents some pipeline delays. Chaining on this machine allows all three floating-point pipelines to operate simultaneously.

The Cray-2

In 1985, the Cray-2 was introduced as a completely redesigned architecture. Available as a multiprocessor with up to four processors, the Cray-2 does not support chaining. With only one memory pipeline per processor, there is a greater memory latency cost. Thus the Cray-2 is an improvement on the Cray X-MP only for those problems that require large amounts of memory.

In 1989, a new company called Cray Computer Corporation and headed by Seymour Cray split from Cray Research, Inc. Each company continued to develop supercomputers using the Cray name. The Cray Computer Corporation followed the ideas derived for the Cray-2.

The Cray Y-MP

Following along the proven lines of the Cray X-MP, Cray Research, Inc.~announced the Cray Y-MP in 1987 as a faster rendition of the Cray X-MP with up to eight processors. A later version, called the Cray C90, was announced in 1990, allowing up to sixteen processors.

The Cray Y-MP/M90 was introduced in 1991. This model is more like the Cray Y-MP than the Cray C90 but provides the ability to handle larger memory addresses and has a much larger memory. Partly because of the additional memory, the Cray Y-MP/M90 operates at about half the speed of the Cray C90. Originally limited to 8 processors, the Cray Y-MP/M90 has since been extended to handle up to sixteen processors.

The Cray Y-MP models are the fastest of the Cray models discussed so far. The clock cycle time of both the Cray Y-MP and the Cray Y-MP/M90 is 6 nsec. The Cray C90 has a clock cycle of 4.2 nsec.

Table 2 summarizes the characteristics of three Cray machines discussed here: the Cray-1, the Cray X-MP, and the Cray Y-MP/832. Likewise, table 3 compares the performance of these three Cray models. Similar tables are provided in chapter 10 on Computer Performance in [Fosdicketal:1996].

The Cray-3

In the meantime, the group at Cray Computer Corporation continued work begun before the split to develop the Cray-3 (based on the Cray-2). This process took about ten years total, and the machine was never sold commercially. In 1993, one Cray-3 was placed in the supercomputing center at NCAR for research purposes. It contains four processors and has a clock cycle time of 2.1 nsec.

The processors of this machine are completely surrounded by a liquid for cooling purposes. When encased in plastic, the appearance of the Cray-3 is similar to a large aquarium with the many wires attached to the processors gently swaying with the flow of the liquid. For this reason, this machine is sometimes called a goldfish bowl.


Bibliography


CSCI 4576/4586: Return to Home Page for CSCI 4576/4586

CU CS Home: Return to CU Boulder Computer Science Dept Home Page


Dept. of Computer Science,
University of Colorado,
Boulder, CO 80309-0430
The HPSC Group: jessup@cs.colorado.edu,
schauble@cs.colorado.edu,
lloyd@cs.colorado.edu