Parallelism

References

PCA, 1.1

Demmel, Lecture 2, Lecture 3

IPC, 2.1, 2.2

Low-level Parallelism

Bit-level parallelism: Up to about 1986, bit-level parallelism dominated the advancement, with 4-bit microprocessors replaced by 8-bit, 16-bit, and so on. Once 32-bit word was reached in the mid-1980s, this trend slows. Wider word length increases bit-level parallelism and expands address space.

Instruction level parallelism

Pipelining: The execution of an instruction is divided into multiple stages. In one single cycle, multiple instructions are being executed at different stages. Pipelining increases throughput but does not decrease the execution time for an instruction.

Superscalar: Deep pipelining allows short cycle (fast clock rate) causing memory bottleneck. In a superscalar processor, the hardware can issue multiple instructions simultaneously.

Multiprocessor level parallelism: The early multiprocessor systems were introduced by small companies. They combined 10 to 20 microprocessors to compete for a share of the minicomputer market. However, RISC microprocessors sapped the CISC multiprocessor momentum in the late 1980s. Shortly thereafter, several large companies began producing RISC multiprocessor systems, especially as servers and mainframe replacements. The early 1990s brought a dramatic advance in the shared memory bus technology, including faster electrical signaling, wider datapaths, pipelined protocols, and multiple paths. This increase in bandwidth allowed the multiprocessor designs to ramp back up to the 10-to-20 range and beyond.

Parallelism and locality

Deep pipelining allows short cycle (fast clock rate) causing memory bottleneck. Superscalar demands even faster memory access. The solution is memory hierarchy. This solution is based on the principle of locality. We use a simple model: two layers (fast cache and slow main memory).

Basic Linear Algebra Subroutines

BLAS is a library of subroutines for common operations such as matrix-vector multiplication, matrix-matrix multiplication, and other similar operations on high performance machines. Now, let us look at the memory access pattern and floating point operations.

BLAS1 (vector): y = a*x + y (saxpy)

                                             flop     memory

    load a         -----------------------------------   1
    for i = 1:n
        load x(i), y(i) ------------------------------ 2*n
        y(i) = a*x(i) + y(i); -------------- 2*n
        store y(i) -----------------------------------   n
    end
                                        -----------------------
                                              2n       3n+1
    ratio Q = 2/3

BLAS2 (matrix-vector): y = A*x + y ([ ]gemv)

                                             flop     memory

    load y(1:n)  -------------------------------------   n
    for j = 1:n
        load A(:,j), x(j)  ----------------------- (n+1)*n
        y(1:n) = A(:,j)*x(j) + y(1:n) -----  2n*n
    end
    store y(1:n)  -----------------------------------    n

                                        ------------------------
                                             2n^2     n^2 + 3n
    ratio Q = 2

Why are we interested in the ratio Q? Let Tf be the time for a floating point operation and Tm for a memory access, then the execution time is:

    flop*Tf + memory*Tm = flop*Tf*(1 + (Tm/Tf)/Q)

Without memory access, the total time would be flop*Tf. If the ratio Q (flop to memory) is large, then we are close to the ideal case. From the above, we can see that matrix-vector operation (BLAS2) is only little better than vector operation. Actually, the operation in the loop in BLAS2 is really saxpy.

BLAS3 (matrix-matrix): C = A*B + C ([ ]gemm)

For this operation, it is necessary to load A, B, and C, and then store C. That is, the memory access is at least 4*n^2. To compute each entry of C, we need an inner product, which takes 2n flops. Thus the total floating point operation count is 2*n^3. So, it is possible that the ratio Q = n/2. But, we must be careful about implementing matrix-matrix multiplication. The following is a traightforward implementation.

                                             flop       memory
    for j = 1:n
        load C(:,j)  ----------------------------------  n*n
        for i = 1:n
            load A(:,i), B(i,j)  ----------------------  (n+1)*n*n
            C(:,j) = A(:,i)*B(i,j) + C(:,j)  2n*n*n
        end
        store C(:,j)  ---------------------------------  n*n
    end
                                          --------------------------
                                             2n^3       n^3 + 3n^2
    ratio Q = 2

This is not better than BLAS2. Actually, the inner loop is BLAS2.

Fast Matrix-Matrix Multiplication

Now, we show how to multiply matrices fast. All modern computers have hierarchical memory structure. The key idea is to reduce slow memory access by utilizing the fast memory. For simplicity, we consider matrix-matrix multiplication: C = A*B, and assume all A, B, and C are of order n and the size of the fast memory is M (M << n^2 and M > 4n).

unblocked version (column version).

	for j=1:n
	    read B(:,j) into fast memory
	    read C(:,j) into fast memory
	    for k=1:n
		read A(:,k) into fast memory
		C(:,j) = C(:,j) + B(k,j)*A(:,k);
	    end;
	    write back C(:,j)
	end;

The total slow memory access is n^3 + n^2 + 2*n^2 = n^3 + 2*n^2. The fast memory size is at least 3*n (three vectors of size n).

column blocked version. Partition C = [C1, ..., CN] and B = [B1, ..., BN]

	for j=1:N
	    read Bj into fast memory
	    read Cj into fast memory
	    for k=1:n
		read A(:,k) into fast memory
		Cj = Cj + A(:,k)*Bj(k,:);
	    end;
	    write back Cj
	end;

The slow memory access is n^2 + 2*n^2 + N*n^2 = (N + 3)*N^2, where N is the number of column blocks. The fast memory size is at least 2*n^2/N. Thus the slow memory access is at least 2*n^4/M + 3*n^2. This means, to get good performance, M must grow at least as fast as n.

submatrix blocked version. Partition

      [C11 ... C1N]        [A11 ... A1N]        [B11 ... B1N]
  C = [    ...    ]    A = [    ...    ]    B = [    ...    ]
      [CN1 ... CNN]	   [AN1 ... ANN]	[BN1 ... BNN]

	for i=1:N
	   for j=1:N
		read Cij into fast memory
		for k=1:N
		    read Aik into fast memory
		    read Bkj into fast memory
		    Cij = Cij + Aik*Bkj;
		end;
		write back Cij
	    end;
	end;

The slow memory access is N*n^2 + N*n^2 + 2*n^2 = (2N + 2)*n^2. The fast memory size is at least 3*(n/M)^2. Thus the slow memory access is at least 2*n^3*sqrt(3/M) + 2*n^2. This lower bound is achieved when N = n*sqrt(3/M).

Example. Three versions of matrix-matrix multiplication were programmed in Matlab. First is the naive scalar version:


function C = MMScalar(A,B)
% C = MMScalar(A,B)
% Scalar version of matrix-matrix multiplication
% Pre
%     A     mxk matrix
%     B     kxn matrix
% Post
%     C     mxn matrix A*B

[m,k] = size(A);
[k,n] = size(B);
C = zeros(m,n);

for i = 1:n
   for j = 1:m
      for p = 1:k
         C(i,j) = C(i,j) + A(i,p)*B(p,j);
      end
   end
end

The second is the SAXPY version:


function C = MMSax(A,B)
% C = MMSax(A,B)
% Saxpy (column) version of matrix-matrix multiplication
% Pre
%    A    mxk matrix
%    B    kxn matrix
% Post
%    C    mxn matrix A*B

[m,k] = size(A);
[k,n] = size(B);
C = zeros(m,n);

for j = 1:n
   for p = 1:k
      C(:,j) = C(:,j) + B(p,j)*A(:,p);
   end
end

The third is the blocked submatrix (BLAS3) version:


function C = MMBlas3(A,B,blocks)
% C = MMBlas3(A,B,blocks)
% Blas3 version of matrix-matrix multiplication
% Pre
%    A    nxn matrix, n is multiple of blocks
%    B    nxn matrix, n is multiple of blocks
%    blocks  number of blocks
% Post
%    C    nxn matrix A*B

[n,n] = size(A);
bsize = n/blocks;         % block size
C = zeros(n,n);

for i = 1:blocks
   ilow = (i-1)*bsize + 1; iup = i*bsize;
   for j = 1:blocks
      jlow = (j-1)*bsize + 1; jup = j*bsize;
      for p = 1:blocks
         plow = (p-1)*bsize + 1; pup = p*bsize;
         C(ilow:iup, jlow:jup) = C(ilow:iup, jlow:jup) + ...
            A(ilow:iup, plow:pup)*B(plow:pup, jlow:jup);
      end
   end
end

The following benchmark program was run on an IBM Thinkpad with a 600 MHz Pentium III. The matrix size is 100-by-100. In the third version, the number of blocks is 4.


% Script file MMBench.m
%
% Performance comparison of MMScalar, MMSax, and MMGax

disp(' ')
n = input('Enter matrix size: ')
blocks = input('Enter number of blocks: ')
disp(sprintf('n = %d', n))
disp(sprintf('blocks = %d', blocks))

A = rand(n,n);
B = rand(n,n);

disp(' ')
disp('Algorithm          Time           Flops')
disp('----------------------------------------')

flops(0);
tic;
C = MMScalar(A,B);
t = toc;
disp(sprintf('MMScalar         %5.4f         %d', t, flops))

flops(0);
tic;
C = MMSax(A,B);
t = toc;
disp(sprintf('MMSax             %5.4f         %d', t, flops))

flops(0);
tic;
C = MMBlas3(A,B, blocks);
t = toc;
disp(sprintf('MMBlas3           %5.4f         %d', t, flops))

The output is:


Algorithm          Time           Flops
----------------------------------------
MMScalar          19.390        2000018
MMSax              0.440        2000018
MMBlas3            0.050        2050639

Strassen algorithm. Another approach to fast matrix multiplication is to reduce the floating point operations. Strassen's algorithm for matrix multiplication requires O(n^(2.8)) operations. Assuming that A and B are n-by-n and n is a power of 2, we partition A and B into equally sized block 2-by-2 matrices:


        [C11  C12] = [A11  A12]*[B11  B12]
        [C21  C22]   [A21  A22] [B21  B22]

Strassen's formulae are


        P1 = (A11 + A22)*(B11 + B22)
        P2 = (A21 + A22)*B11
        P3 = A11*(B12 - B22)
        P4 = A22*(B21 - B11)
        P5 = (A11 + A12)*B22
        P6 = (A21 - A11)*(B11 + B12)
        P7 = (A12 - A22)*(B21 + B22)
       C11 = P1 + P4 - P5 + P7
       C12 = P3 + P5
       C21 = P2 + P4
       C22 = P1 + P3 - P2 + P6

We can apply the same procedure recursively on the multiplications associated with the Pi. Since the algorithm performs seven recursive calls on matrices of size n/2, and 18 additions of matrices of size n/2. Denote T(n) the floating point operations, we have the following recursion on the operation count T(n) = 7*T(n/2) + 18*(n/2)^2. It can be verified that it has a solution T(n) = O(n^(log_2(7))) = O(n^(2.8)). Actually, it can be shown that the complexity is about 4n^(2.8). So, it not only has a lower exponent but also a reasonable coefficient. The value of Strassen's algorithm is not just its low asymptotic complexity but its reduction of the problem to smaller problems. So, the recursion terminates when the matrices are of size small enough to fit in fast memory. Some recent experience of various researchers has shown that Strassen's algorithm is of practical interest, even for n in hundreds.

Error Bound

Let Cc be the computed C = A*B, then it can be shown that the forward error bound is

      |C - Cc| <= (n*u/(1-n*u))*|A|*|B|

where u is the unit roundoff, which is about 10^(-7) in single precision and 10^(-16) in double precision. The normwise forward error bound is

      ||C - Cc|| <= (n*u/(1-n*u))*||A||*||B||

Some simple matrix norms are:

Frobenius norm:
	||A||_F = sqrt(sum_ij |Aij|^2)
1-norm (max column sum):
	||A||_1 = max_j(sum_i |Aij|)
infty-norm (max row sum):
	||A||_infty = max_i(sum_j |Aij|)
<\pre>

Summary
To get high performance, we must consider pipelining and locality.
It is a challenge to juggle all these issues. Thus we develop
tools (BLAS).

High-level Parallelism

A generic parallel computer: A cluster of processors with possible
local memories are connected by an interconnection network.

A taxonomy of parallel architectures

From address space point of view.


In shared address space machines processors communicate via
reading and writing shared memory. Communication occurs implicitly
as a result of conventional memory access instructions (loads and
stores)
Based on assess time, we have UMA (uniform memory access) and
NUMA (nonuniform memory access). UMA leads to an ideal PRAM model.
This model is widely used in theoretical research. It gives a
lower bound.

Example of commercial shared memory machine is
IBM SMP.
The characteristics
of shared memory machines are: easy to program, program model is
similar to multiprogramming; fast communication;
expensive for large number (>32) of processors.


In distributed memory machines, processors, each
with a local memory, are connected by an interconnection
network.

Processors communicate via passing messages.
Example of commercial message-passing machine is IBM SP-2.

The characteristics of distributed memory machines are:
harder to program (message passing); slow communication;
less expensive for large number of processors.

From control point of view.


In SIMD (single instruction multiple data) machines,
all processors execute same instructions on multiple data.
Examples of commercial SIMD machines: Cray T90 (single
processor vector machine), CM-2 (Thinking Machines).


In MIMD (multiple instruction multiple data)
machines, each processor has its local control.
Almost all commercial parallel machines are MIMD.

From programming point of view.


Data parallelism.

This is originated from SIMD.
We partition data among processors. For a simple example,
array operations (operating on components in parallel).
In data parallelism, if communication is required (for
example, global sum) then communication is implicit.
If synchronization is required (for example, between two
statements), then synchronization is implicit. Recent
standard is
HPF
(high performance Fortran). We can use MATLAB to prototype
vector operations. For example,
	C(1:n) = A(1:n) + B(1:n)

is a vector operation and
	for i=1:n
	    C(i) = A(i) + B(i);
	end

is a scalar version.


Message passing. (distributed memory)

Basic operations are send, recv, etc.
For example,
PVM
(parallel virtual machine) and
MPI
(message passing interface). Usually, we use SPMD (single
program multiple data). Here is the basic structure:
	if (myRank == 0)
	    ...
	else
	    ...
	end;

Message passing programming is difficult and error-prone.


Thread programming (shared address space).
A process can have multiple threads and access multiple
processors. For example,
Solaris thread programming on multiprocessor Sparc.
Procedures: thr_create, thr_join, mutex_lock, mutex_unlock,
cond_wait, cond_signal, cond_broadcast.

From architecture point of view


Dataflow architecture


Systolic architecture>