Parallelism

References

  • PCA, 1.1
  • Demmel, Lecture 2, Lecture 3
  • IPC, 2.1, 2.2
  • Low-level Parallelism

  • Bit-level parallelism: Up to about 1986, bit-level parallelism dominated the advancement, with 4-bit microprocessors replaced by 8-bit, 16-bit, and so on. Once 32-bit word was reached in the mid-1980s, this trend slows. Wider word length increases bit-level parallelism and expands address space.
  • Instruction level parallelism
  • Pipelining: The execution of an instruction is divided into multiple stages. In one single cycle, multiple instructions are being executed at different stages. Pipelining increases throughput but does not decrease the execution time for an instruction.
  • Superscalar: Deep pipelining allows short cycle (fast clock rate) causing memory bottleneck. In a superscalar processor, the hardware can issue multiple instructions simultaneously.
  • Multiprocessor level parallelism: The early multiprocessor systems were introduced by small companies. They combined 10 to 20 microprocessors to compete for a share of the minicomputer market. However, RISC microprocessors sapped the CISC multiprocessor momentum in the late 1980s. Shortly thereafter, several large companies began producing RISC multiprocessor systems, especially as servers and mainframe replacements. The early 1990s brought a dramatic advance in the shared memory bus technology, including faster electrical signaling, wider datapaths, pipelined protocols, and multiple paths. This increase in bandwidth allowed the multiprocessor designs to ramp back up to the 10-to-20 range and beyond.
  • Parallelism and locality

    Deep pipelining allows short cycle (fast clock rate) causing memory bottleneck. Superscalar demands even faster memory access. The solution is memory hierarchy. This solution is based on the principle of locality. We use a simple model: two layers (fast cache and slow main memory).

    Basic Linear Algebra Subroutines

    BLAS is a library of subroutines for common operations such as matrix-vector multiplication, matrix-matrix multiplication, and other similar operations on high performance machines. Now, let us look at the memory access pattern and floating point operations.

    BLAS1 (vector): y = a*x + y (saxpy)

                                                 flop     memory
    
        load a         -----------------------------------   1
        for i = 1:n
            load x(i), y(i) ------------------------------ 2*n
            y(i) = a*x(i) + y(i); -------------- 2*n
            store y(i) -----------------------------------   n
        end
                                            -----------------------
                                                  2n       3n+1
        ratio Q = 2/3
    

    BLAS2 (matrix-vector): y = A*x + y ([ ]gemv)

                                                 flop     memory
    
        load y(1:n)  -------------------------------------   n
        for j = 1:n
            load A(:,j), x(j)  ----------------------- (n+1)*n
            y(1:n) = A(:,j)*x(j) + y(1:n) -----  2n*n
        end
        store y(1:n)  -----------------------------------    n
    
                                            ------------------------
                                                 2n^2     n^2 + 3n
        ratio Q = 2
    

    Why are we interested in the ratio Q? Let Tf be the time for a floating point operation and Tm for a memory access, then the execution time is:

        flop*Tf + memory*Tm = flop*Tf*(1 + (Tm/Tf)/Q)
    
    Without memory access, the total time would be flop*Tf. If the ratio Q (flop to memory) is large, then we are close to the ideal case. From the above, we can see that matrix-vector operation (BLAS2) is only little better than vector operation. Actually, the operation in the loop in BLAS2 is really saxpy.

    BLAS3 (matrix-matrix): C = A*B + C ([ ]gemm)

    For this operation, it is necessary to load A, B, and C, and then store C. That is, the memory access is at least 4*n^2. To compute each entry of C, we need an inner product, which takes 2n flops. Thus the total floating point operation count is 2*n^3. So, it is possible that the ratio Q = n/2. But, we must be careful about implementing matrix-matrix multiplication. The following is a traightforward implementation.

                                                 flop       memory
        for j = 1:n
            load C(:,j)  ----------------------------------  n*n
            for i = 1:n
                load A(:,i), B(i,j)  ----------------------  (n+1)*n*n
                C(:,j) = A(:,i)*B(i,j) + C(:,j)  2n*n*n
            end
            store C(:,j)  ---------------------------------  n*n
        end
                                              --------------------------
                                                 2n^3       n^3 + 3n^2
        ratio Q = 2
    
    This is not better than BLAS2. Actually, the inner loop is BLAS2.

    Fast Matrix-Matrix Multiplication

    Now, we show how to multiply matrices fast. All modern computers have hierarchical memory structure. The key idea is to reduce slow memory access by utilizing the fast memory. For simplicity, we consider matrix-matrix multiplication: C = A*B, and assume all A, B, and C are of order n and the size of the fast memory is M (M << n^2 and M > 4n).
  • unblocked version (column version).
    	for j=1:n
    	    read B(:,j) into fast memory
    	    read C(:,j) into fast memory
    	    for k=1:n
    		read A(:,k) into fast memory
    		C(:,j) = C(:,j) + B(k,j)*A(:,k);
    	    end;
    	    write back C(:,j)
    	end;
    
    The total slow memory access is n^3 + n^2 + 2*n^2 = n^3 + 2*n^2. The fast memory size is at least 3*n (three vectors of size n).
  • column blocked version. Partition C = [C1, ..., CN] and B = [B1, ..., BN]
    	for j=1:N
    	    read Bj into fast memory
    	    read Cj into fast memory
    	    for k=1:n
    		read A(:,k) into fast memory
    		Cj = Cj + A(:,k)*Bj(k,:);
    	    end;
    	    write back Cj
    	end;
    
    The slow memory access is n^2 + 2*n^2 + N*n^2 = (N + 3)*N^2, where N is the number of column blocks. The fast memory size is at least 2*n^2/N. Thus the slow memory access is at least 2*n^4/M + 3*n^2. This means, to get good performance, M must grow at least as fast as n.
  • submatrix blocked version. Partition
          [C11 ... C1N]        [A11 ... A1N]        [B11 ... B1N]
      C = [    ...    ]    A = [    ...    ]    B = [    ...    ]
          [CN1 ... CNN]	   [AN1 ... ANN]	[BN1 ... BNN]
    
    	for i=1:N
    	   for j=1:N
    		read Cij into fast memory
    		for k=1:N
    		    read Aik into fast memory
    		    read Bkj into fast memory
    		    Cij = Cij + Aik*Bkj;
    		end;
    		write back Cij
    	    end;
    	end;
    
    The slow memory access is N*n^2 + N*n^2 + 2*n^2 = (2N + 2)*n^2. The fast memory size is at least 3*(n/M)^2. Thus the slow memory access is at least 2*n^3*sqrt(3/M) + 2*n^2. This lower bound is achieved when N = n*sqrt(3/M).
  • Example. Three versions of matrix-matrix multiplication were programmed in Matlab. First is the naive scalar version:
    
    function C = MMScalar(A,B)
    % C = MMScalar(A,B)
    % Scalar version of matrix-matrix multiplication
    % Pre
    %     A     mxk matrix
    %     B     kxn matrix
    % Post
    %     C     mxn matrix A*B
    
    [m,k] = size(A);
    [k,n] = size(B);
    C = zeros(m,n);
    
    for i = 1:n
       for j = 1:m
          for p = 1:k
             C(i,j) = C(i,j) + A(i,p)*B(p,j);
          end
       end
    end
    
    
    The second is the SAXPY version:
    
    function C = MMSax(A,B)
    % C = MMSax(A,B)
    % Saxpy (column) version of matrix-matrix multiplication
    % Pre
    %    A    mxk matrix
    %    B    kxn matrix
    % Post
    %    C    mxn matrix A*B
    
    [m,k] = size(A);
    [k,n] = size(B);
    C = zeros(m,n);
    
    for j = 1:n
       for p = 1:k
          C(:,j) = C(:,j) + B(p,j)*A(:,p);
       end
    end
    
    
    The third is the blocked submatrix (BLAS3) version:
    
    function C = MMBlas3(A,B,blocks)
    % C = MMBlas3(A,B,blocks)
    % Blas3 version of matrix-matrix multiplication
    % Pre
    %    A    nxn matrix, n is multiple of blocks
    %    B    nxn matrix, n is multiple of blocks
    %    blocks  number of blocks
    % Post
    %    C    nxn matrix A*B
    
    [n,n] = size(A);
    bsize = n/blocks;         % block size
    C = zeros(n,n);
    
    for i = 1:blocks
       ilow = (i-1)*bsize + 1; iup = i*bsize;
       for j = 1:blocks
          jlow = (j-1)*bsize + 1; jup = j*bsize;
          for p = 1:blocks
             plow = (p-1)*bsize + 1; pup = p*bsize;
             C(ilow:iup, jlow:jup) = C(ilow:iup, jlow:jup) + ...
                A(ilow:iup, plow:pup)*B(plow:pup, jlow:jup);
          end
       end
    end
    
    
    The following benchmark program was run on an IBM Thinkpad with a 600 MHz Pentium III. The matrix size is 100-by-100. In the third version, the number of blocks is 4.
    
    % Script file MMBench.m
    %
    % Performance comparison of MMScalar, MMSax, and MMGax
    
    disp(' ')
    n = input('Enter matrix size: ')
    blocks = input('Enter number of blocks: ')
    disp(sprintf('n = %d', n))
    disp(sprintf('blocks = %d', blocks))
    
    A = rand(n,n);
    B = rand(n,n);
    
    disp(' ')
    disp('Algorithm          Time           Flops')
    disp('----------------------------------------')
    
    flops(0);
    tic;
    C = MMScalar(A,B);
    t = toc;
    disp(sprintf('MMScalar         %5.4f         %d', t, flops))
    
    flops(0);
    tic;
    C = MMSax(A,B);
    t = toc;
    disp(sprintf('MMSax             %5.4f         %d', t, flops))
    
    flops(0);
    tic;
    C = MMBlas3(A,B, blocks);
    t = toc;
    disp(sprintf('MMBlas3           %5.4f         %d', t, flops))
    
    
    The output is:
    
    Algorithm          Time           Flops
    ----------------------------------------
    MMScalar          19.390        2000018
    MMSax              0.440        2000018
    MMBlas3            0.050        2050639
    
    
  • Strassen algorithm. Another approach to fast matrix multiplication is to reduce the floating point operations. Strassen's algorithm for matrix multiplication requires O(n^(2.8)) operations. Assuming that A and B are n-by-n and n is a power of 2, we partition A and B into equally sized block 2-by-2 matrices:
    
            [C11  C12] = [A11  A12]*[B11  B12]
            [C21  C22]   [A21  A22] [B21  B22]
    
    
    Strassen's formulae are
    
            P1 = (A11 + A22)*(B11 + B22)
            P2 = (A21 + A22)*B11
            P3 = A11*(B12 - B22)
            P4 = A22*(B21 - B11)
            P5 = (A11 + A12)*B22
            P6 = (A21 - A11)*(B11 + B12)
            P7 = (A12 - A22)*(B21 + B22)
           C11 = P1 + P4 - P5 + P7
           C12 = P3 + P5
           C21 = P2 + P4
           C22 = P1 + P3 - P2 + P6
    
    
    We can apply the same procedure recursively on the multiplications associated with the Pi. Since the algorithm performs seven recursive calls on matrices of size n/2, and 18 additions of matrices of size n/2. Denote T(n) the floating point operations, we have the following recursion on the operation count T(n) = 7*T(n/2) + 18*(n/2)^2. It can be verified that it has a solution T(n) = O(n^(log_2(7))) = O(n^(2.8)). Actually, it can be shown that the complexity is about 4n^(2.8). So, it not only has a lower exponent but also a reasonable coefficient. The value of Strassen's algorithm is not just its low asymptotic complexity but its reduction of the problem to smaller problems. So, the recursion terminates when the matrices are of size small enough to fit in fast memory. Some recent experience of various researchers has shown that Strassen's algorithm is of practical interest, even for n in hundreds.
  • Error Bound

    Let Cc be the computed C = A*B, then it can be shown that the forward error bound is
          |C - Cc| <= (n*u/(1-n*u))*|A|*|B|
    
    where u is the unit roundoff, which is about 10^(-7) in single precision and 10^(-16) in double precision. The normwise forward error bound is
          ||C - Cc|| <= (n*u/(1-n*u))*||A||*||B||
    
    Some simple matrix norms are:
    Frobenius norm:
    	||A||_F = sqrt(sum_ij |Aij|^2)
    1-norm (max column sum):
    	||A||_1 = max_j(sum_i |Aij|)
    infty-norm (max row sum):
    	||A||_infty = max_i(sum_j |Aij|)
    <\pre>
    
    

    Summary

    To get high performance, we must consider pipelining and locality. It is a challenge to juggle all these issues. Thus we develop tools (BLAS).

    High-level Parallelism

    A generic parallel computer: A cluster of processors with possible local memories are connected by an interconnection network.

    A taxonomy of parallel architectures

  • From address space point of view.

    In shared address space machines processors communicate via reading and writing shared memory. Communication occurs implicitly as a result of conventional memory access instructions (loads and stores) Based on assess time, we have UMA (uniform memory access) and NUMA (nonuniform memory access). UMA leads to an ideal PRAM model. This model is widely used in theoretical research. It gives a lower bound. Example of commercial shared memory machine is IBM SMP. The characteristics of shared memory machines are: easy to program, program model is similar to multiprogramming; fast communication; expensive for large number (>32) of processors.

    In distributed memory machines, processors, each with a local memory, are connected by an interconnection network. Processors communicate via passing messages. Example of commercial message-passing machine is IBM SP-2. The characteristics of distributed memory machines are: harder to program (message passing); slow communication; less expensive for large number of processors.

  • From control point of view.

    In SIMD (single instruction multiple data) machines, all processors execute same instructions on multiple data. Examples of commercial SIMD machines: Cray T90 (single processor vector machine), CM-2 (Thinking Machines).

    In MIMD (multiple instruction multiple data) machines, each processor has its local control. Almost all commercial parallel machines are MIMD.

  • From programming point of view.

    Data parallelism. This is originated from SIMD. We partition data among processors. For a simple example, array operations (operating on components in parallel). In data parallelism, if communication is required (for example, global sum) then communication is implicit. If synchronization is required (for example, between two statements), then synchronization is implicit. Recent standard is HPF (high performance Fortran). We can use MATLAB to prototype vector operations. For example,

    	C(1:n) = A(1:n) + B(1:n)
    
    is a vector operation and
    	for i=1:n
    	    C(i) = A(i) + B(i);
    	end
    
    is a scalar version.

    Message passing. (distributed memory) Basic operations are send, recv, etc. For example, PVM (parallel virtual machine) and MPI (message passing interface). Usually, we use SPMD (single program multiple data). Here is the basic structure:

    	if (myRank == 0)
    	    ...
    	else
    	    ...
    	end;
    
    Message passing programming is difficult and error-prone.

    Thread programming (shared address space). A process can have multiple threads and access multiple processors. For example, Solaris thread programming on multiprocessor Sparc. Procedures: thr_create, thr_join, mutex_lock, mutex_unlock, cond_wait, cond_signal, cond_broadcast.

  • From architecture point of view

    Dataflow architecture

    Systolic architecture>