Solving Linear Systems I: Gaussian Elimination

In this lecture we cover Demmel's lecture 14.

Major issues in solving linear systems

  • Reorganize an algorithm to use BLAS3
  • Layout matrices across processors to balance load and minimize communication
  • Design parallel software to be portable
  • Six most important problems of linear algebra

  • General system of linear equations A*x = b
  • Symmetric positive definite systems A*x = b
  • Least squares min(norm(A*x - b))
  • Eigenvalue problem A*x = lambda*x
  • Symmetric eigenvalue problem A*x = lambda*x
  • Singular value problem
  • Selction of algorithm depends on whether the system is sparse or dense.

    Distinction among the six problems

  • In theory, the first two problems (A*x = b and least squares) can be solved exactly in finite number of steps, O(n^3)
  • The last two problems (eigenvalue and singular value) can only be solved approximately. Usually an algorithm involves two phases:

    -Initial reduction to a simple matrix. This step is usually difficult to convert to BLAS3

    -Iterative phase. It is cheaper than the reduction phase in symmetric case and more expensive in general case

  • Gaussian Elimination (direct method for dense matrices)

    A generic algorithm

    for i = 1 to n-1
        for j = i+1 to n
    	A(j,i) = A(j,i)/A(i,i)
        end;
        for j = i+1 to n
    	for k = i+1 to n
    	    A(j,k) = A(j,k) - A(j,i)*A(i,k)
    	end
        end
    end
    
    Vector version

    for i = 1 to n-1
        A(i+1:n,i) = A(i+1:n,i)/A(i,i); --------------- BLAS1
        A(i+1:n,i+1:n) = A(i+1:n,i+1:n) - A(i+1:n,i)*A(i,i+1:n)
    				    --------------- BLAS3
    end
    
    Serial time: (2/3)*n^3 + O(n^2);

    PRAM time: 3*(n-1).

    Example (partial pivoting)

    for i = 1 to n-1
        find the index of max_{i<=j<=n} |A(j,i)|;
        record the index in p(i);
        if |A(p(i),i)| < epsilon, singular; exit end;
        if p(i) != i, swap A(i,:) and A(p(i),:) end;
        A(i+1:n,i) = A(i+1:n,i)/A(i,i);
        A(i+1:n,i+1:n) = A(i+1:n,i+1:n) - A(i+1:n,i)*A(i,i+1:n)
    end.
    
    The above algorithm computes the decomposition: P*A = L*U where P is a permutation, L is a lower triangular matrix with unit diagonal, and U is upper triangular.

    Reorganize GEPP to use BLAS3. The idea is to delay updating and save up a sequence of BLAS2 operations and apply them all at once using BLAS3. The block column size b (number of columns in a block) should be chosen such that it is small enough so that the block column fits in fast memory; also, it is large enough to make matrix-matrix multiplication fast.

    for i = 1 to n-1 step b
        end = min(ib+b-1,n);	% locate a block column
        % LU factorize A(ib:n,ib:end)
        for i = ib to end
    	find index of max_{i<=j<=n} |A(j,i)|;
    	record the index in p(i);
    	if singular, exit;
    	swap A(p(i),:) and A(i,:);
    	A(i+1:n,i) = A(i+1:n,i)/A(i,i);
    	% only update inside the block, BLAS2
    	A(i+1:n,i+1:end) = A(i+1:n,i+1:end) - A(i+1:n,i)*A(i,i+1:end);
        end
        % let LL = A(ib:end,ib:end), the lower triangular
        A(ib:end,end+1:n) = LL^{-1}*A(ib:end,end+1:n);
        % delayed update, BLAS3
        A(end+1:n,end+1:n) = A(end+1:n,end+1:n)
    		         - A(end+1:n,ib:end)*A(ib:end,end+1:n)
    end.
    
    Observation: As Gaussian elimination proceeds, it updates smaller blocks A(end+1:n, end+1:n).

    Column Blocked Layout

  • poor load balance
  • Column Cyclic Layout

  • good load balance
  • cannot use BLAS2 to factorize A(ib:n, ib:end)
  • cannot use BLAS3 to update A(end+1:n, end+1:n)
  • Column Block Cyclic Layout

  • slightly worse load balance
  • can use BLAS2 and BLAS3
  • factorization of A(ib:n, ib:end) can take place on one processor, a serial bottleneck
  • Row and Column (2D) Block Cyclic

    Notations

    (pi,pj): index of a processor, 0<= pi < prow, 0<= pj < pcol

    p = pi*pj: number of processors

    brow: number of rows in a block

    bcol: number of columns in a block

    (:,pj): processor row

    (pi,:): processor column

    Mapping

    A block A(i,j) is mapped to a processor (pi,pj) where pi = floor(i/brow) mod prow and pj = floor(j/bcol) mod pcol.

    Example. A matrix of order n=16, block size 2x2 (brow=bcol=2), thus it is an 8x8 block matrix, using 4 processors with prow=pcol=2.

    Distributed Gaussian Elimination with 2D Block Cyclic Layout

    Assume b = brow = bcol.

  • Step 1. Reduction among prow processors in the current column to find the pivot and broadcast its index.

  • Step 2. Exchange rows in the column and broadcast the pivot row.

  • Step 3. Compute multipliers (local) using BLAS1.

  • Step 4. Update A(ib:n, ib:end) (local) using BLAS2.

  • Step 5. Broadcast pivot information to all other processors in the row.

  • Step 6. Swap rows.

  • Step 7. Send LL to the row.

  • Step 8. Solve for X in LL*X = A(ib:end, end+1:n) (a set of lower triangular systems). The solution X overwrites A(ib:end, end+1:n). (local)

  • Step 9. Send blocks down for updating. Assuming tree-based broadcast, the communication time is
    	log2(prow)*(alpha + (b*(n-end)/pcol)*beta)
    
  • Step 10. Send blocks to the right for updating. Assuming ring-based broadcast, the communication time is
    	2*(alpha + (b*(n-end)/prow)*beta)
    
    Note the we only count the left-most processor (send and receive), the others can overlap.

  • Step 11. Update A(end+1:n,end+1:n). The computation time is
  • 	2*b*(n-end)^2/p
    
    Steps 9 and 10 dominate the communication time and stpe 11 dominates the computation time. The total time is approximately
    	  alpha*(n/b*(log2(prow) + 2))
    	+ beta*(log2(prow)/(2*pcol) + 1/prow)*n^2
    	+ (2/3)*n^3/p
    
          =   alpha*(n*(6 + log2(prow)))
    	+ beta*(n^2*(2*(prow - 1)/p + (pcol - 1)/p + log2(prow)/(2*pcol)))
    	+ (2/3)*n^3/p
    
    The efficiency is the reciprocal of
    	  1 + alpha*(1.5p*(6+log2(prow))/n^2
    	+ beta*((1.5/n)*(2*prow+pcol+prow*log2(prow)/2-3))
    
    Remarks

  • Efficiency increases as alpha and beta (communication) decreases. When alpha=beta=0, efficiency is 100%, which means load balance.
  • Fixing p, alpha, and beta, efficiency increases as n increases.
  • Efficiency is maintained if n increases at the same rate as sqrt(p), assuming prow ~ pcol ~ sqrt(p)
  • To maximize the efficiency, prow should be slightly smaller than sqrt(p)