Deep pipelining allows short cycle (fast clock rate) causing memory bottleneck. Superscalar demands even faster memory access. The solution is memory hierarchy. This solution is based on the principle of locality. We use a simple model: two layers (fast cache and slow main memory).
BLAS1 (vector): y = a*x + y (saxpy)
flop memory load a ----------------------------------- 1 for i = 1:n load x(i), y(i) ------------------------------ 2*n y(i) = a*x(i) + y(i); -------------- 2*n store y(i) ----------------------------------- n end ----------------------- 2n 3n+1 ratio Q = 2/3
BLAS2 (matrix-vector): y = A*x + y ([ ]gemv)
flop memory load y(1:n) ------------------------------------- n for j = 1:n load A(:,j), x(j) ----------------------- (n+1)*n y(1:n) = A(:,j)*x(j) + y(1:n) ----- 2n*n end store y(1:n) ----------------------------------- n ------------------------ 2n^2 n^2 + 3n ratio Q = 2
Why are we interested in the ratio Q? Let Tf be the time for a floating point operation and Tm for a memory access, then the execution time is:
flop*Tf + memory*Tm = flop*Tf*(1 + (Tm/Tf)/Q)Without memory access, the total time would be flop*Tf. If the ratio Q (flop to memory) is large, then we are close to the ideal case. From the above, we can see that matrix-vector operation (BLAS2) is only little better than vector operation. Actually, the operation in the loop in BLAS2 is really saxpy.
BLAS3 (matrix-matrix): C = A*B + C ([ ]gemm)
For this operation, it is necessary to load A, B, and C, and then store C. That is, the memory access is at least 4*n^2. To compute each entry of C, we need an inner product, which takes 2n flops. Thus the total floating point operation count is 2*n^3. So, it is possible that the ratio Q = n/2. But, we must be careful about implementing matrix-matrix multiplication. The following is a traightforward implementation.
flop memory for j = 1:n load C(:,j) ---------------------------------- n*n for i = 1:n load A(:,i), B(i,j) ---------------------- (n+1)*n*n C(:,j) = A(:,i)*B(i,j) + C(:,j) 2n*n*n end store C(:,j) --------------------------------- n*n end -------------------------- 2n^3 n^3 + 3n^2 ratio Q = 2This is not better than BLAS2. Actually, the inner loop is BLAS2.
for j=1:n read B(:,j) into fast memory read C(:,j) into fast memory for k=1:n read A(:,k) into fast memory C(:,j) = C(:,j) + B(k,j)*A(:,k); end; write back C(:,j) end;The total slow memory access is n^3 + n^2 + 2*n^2 = n^3 + 2*n^2. The fast memory size is at least 3*n (three vectors of size n).
for j=1:N read Bj into fast memory read Cj into fast memory for k=1:n read A(:,k) into fast memory Cj = Cj + A(:,k)*Bj(k,:); end; write back Cj end;The slow memory access is n^2 + 2*n^2 + N*n^2 = (N + 3)*N^2, where N is the number of column blocks. The fast memory size is at least 2*n^2/N. Thus the slow memory access is at least 2*n^4/M + 3*n^2. This means, to get good performance, M must grow at least as fast as n.
[C11 ... C1N] [A11 ... A1N] [B11 ... B1N] C = [ ... ] A = [ ... ] B = [ ... ] [CN1 ... CNN] [AN1 ... ANN] [BN1 ... BNN] for i=1:N for j=1:N read Cij into fast memory for k=1:N read Aik into fast memory read Bkj into fast memory Cij = Cij + Aik*Bkj; end; write back Cij end; end;The slow memory access is N*n^2 + N*n^2 + 2*n^2 = (2N + 2)*n^2. The fast memory size is at least 3*(n/M)^2. Thus the slow memory access is at least 2*n^3*sqrt(3/M) + 2*n^2. This lower bound is achieved when N = n*sqrt(3/M).
function C = MMScalar(A,B) % C = MMScalar(A,B) % Scalar version of matrix-matrix multiplication % Pre % A mxk matrix % B kxn matrix % Post % C mxn matrix A*B [m,k] = size(A); [k,n] = size(B); C = zeros(m,n); for i = 1:n for j = 1:m for p = 1:k C(i,j) = C(i,j) + A(i,p)*B(p,j); end end endThe second is the SAXPY version:
function C = MMSax(A,B) % C = MMSax(A,B) % Saxpy (column) version of matrix-matrix multiplication % Pre % A mxk matrix % B kxn matrix % Post % C mxn matrix A*B [m,k] = size(A); [k,n] = size(B); C = zeros(m,n); for j = 1:n for p = 1:k C(:,j) = C(:,j) + B(p,j)*A(:,p); end endThe third is the blocked submatrix (BLAS3) version:
function C = MMBlas3(A,B,blocks) % C = MMBlas3(A,B,blocks) % Blas3 version of matrix-matrix multiplication % Pre % A nxn matrix, n is multiple of blocks % B nxn matrix, n is multiple of blocks % blocks number of blocks % Post % C nxn matrix A*B [n,n] = size(A); bsize = n/blocks; % block size C = zeros(n,n); for i = 1:blocks ilow = (i-1)*bsize + 1; iup = i*bsize; for j = 1:blocks jlow = (j-1)*bsize + 1; jup = j*bsize; for p = 1:blocks plow = (p-1)*bsize + 1; pup = p*bsize; C(ilow:iup, jlow:jup) = C(ilow:iup, jlow:jup) + ... A(ilow:iup, plow:pup)*B(plow:pup, jlow:jup); end end endThe following benchmark program was run on an IBM Thinkpad with a 600 MHz Pentium III. The matrix size is 100-by-100. In the third version, the number of blocks is 4.
% Script file MMBench.m % % Performance comparison of MMScalar, MMSax, and MMGax disp(' ') n = input('Enter matrix size: ') blocks = input('Enter number of blocks: ') disp(sprintf('n = %d', n)) disp(sprintf('blocks = %d', blocks)) A = rand(n,n); B = rand(n,n); disp(' ') disp('Algorithm Time Flops') disp('----------------------------------------') flops(0); tic; C = MMScalar(A,B); t = toc; disp(sprintf('MMScalar %5.4f %d', t, flops)) flops(0); tic; C = MMSax(A,B); t = toc; disp(sprintf('MMSax %5.4f %d', t, flops)) flops(0); tic; C = MMBlas3(A,B, blocks); t = toc; disp(sprintf('MMBlas3 %5.4f %d', t, flops))The output is:
Algorithm Time Flops ---------------------------------------- MMScalar 19.390 2000018 MMSax 0.440 2000018 MMBlas3 0.050 2050639
[C11 C12] = [A11 A12]*[B11 B12] [C21 C22] [A21 A22] [B21 B22]Strassen's formulae are
P1 = (A11 + A22)*(B11 + B22) P2 = (A21 + A22)*B11 P3 = A11*(B12 - B22) P4 = A22*(B21 - B11) P5 = (A11 + A12)*B22 P6 = (A21 - A11)*(B11 + B12) P7 = (A12 - A22)*(B21 + B22) C11 = P1 + P4 - P5 + P7 C12 = P3 + P5 C21 = P2 + P4 C22 = P1 + P3 - P2 + P6We can apply the same procedure recursively on the multiplications associated with the Pi. Since the algorithm performs seven recursive calls on matrices of size n/2, and 18 additions of matrices of size n/2. Denote T(n) the floating point operations, we have the following recursion on the operation count T(n) = 7*T(n/2) + 18*(n/2)^2. It can be verified that it has a solution T(n) = O(n^(log_2(7))) = O(n^(2.8)). Actually, it can be shown that the complexity is about 4n^(2.8). So, it not only has a lower exponent but also a reasonable coefficient. The value of Strassen's algorithm is not just its low asymptotic complexity but its reduction of the problem to smaller problems. So, the recursion terminates when the matrices are of size small enough to fit in fast memory. Some recent experience of various researchers has shown that Strassen's algorithm is of practical interest, even for n in hundreds.
|C - Cc| <= (n*u/(1-n*u))*|A|*|B|where u is the unit roundoff, which is about 10^(-7) in single precision and 10^(-16) in double precision. The normwise forward error bound is
||C - Cc|| <= (n*u/(1-n*u))*||A||*||B||Some simple matrix norms are:
Frobenius norm: ||A||_F = sqrt(sum_ij |Aij|^2) 1-norm (max column sum): ||A||_1 = max_j(sum_i |Aij|) infty-norm (max row sum): ||A||_infty = max_i(sum_j |Aij|) <\pre>Summary
To get high performance, we must consider pipelining and locality. It is a challenge to juggle all these issues. Thus we develop tools (BLAS).High-level Parallelism
A generic parallel computer: A cluster of processors with possible local memories are connected by an interconnection network.A taxonomy of parallel architectures
From address space point of view. In shared address space machines processors communicate via reading and writing shared memory. Communication occurs implicitly as a result of conventional memory access instructions (loads and stores) Based on assess time, we have UMA (uniform memory access) and NUMA (nonuniform memory access). UMA leads to an ideal PRAM model. This model is widely used in theoretical research. It gives a lower bound. Example of commercial shared memory machine is IBM SMP. The characteristics of shared memory machines are: easy to program, program model is similar to multiprogramming; fast communication; expensive for large number (>32) of processors.
In distributed memory machines, processors, each with a local memory, are connected by an interconnection network. Processors communicate via passing messages. Example of commercial message-passing machine is IBM SP-2. The characteristics of distributed memory machines are: harder to program (message passing); slow communication; less expensive for large number of processors.
From control point of view. In SIMD (single instruction multiple data) machines, all processors execute same instructions on multiple data. Examples of commercial SIMD machines: Cray T90 (single processor vector machine), CM-2 (Thinking Machines).
In MIMD (multiple instruction multiple data) machines, each processor has its local control. Almost all commercial parallel machines are MIMD.
From programming point of view. Data parallelism. This is originated from SIMD. We partition data among processors. For a simple example, array operations (operating on components in parallel). In data parallelism, if communication is required (for example, global sum) then communication is implicit. If synchronization is required (for example, between two statements), then synchronization is implicit. Recent standard is HPF (high performance Fortran). We can use MATLAB to prototype vector operations. For example,
C(1:n) = A(1:n) + B(1:n)is a vector operation andfor i=1:n C(i) = A(i) + B(i); endis a scalar version.Message passing. (distributed memory) Basic operations are send, recv, etc. For example, PVM (parallel virtual machine) and MPI (message passing interface). Usually, we use SPMD (single program multiple data). Here is the basic structure:
if (myRank == 0) ... else ... end;Message passing programming is difficult and error-prone.Thread programming (shared address space). A process can have multiple threads and access multiple processors. For example, Solaris thread programming on multiprocessor Sparc. Procedures: thr_create, thr_join, mutex_lock, mutex_unlock, cond_wait, cond_signal, cond_broadcast.
From architecture point of view Dataflow architecture
Systolic architecture>