Solving Linear Systems I: Gaussian Elimination

In this lecture we cover Demmel's lecture 14.

Major issues in solving linear systems

Reorganize an algorithm to use BLAS3

Layout matrices across processors to balance load and minimize communication

Design parallel software to be portable

Six most important problems of linear algebra

General system of linear equations A*x = b

Symmetric positive definite systems A*x = b

Least squares min(norm(A*x - b))

Eigenvalue problem A*x = lambda*x

Symmetric eigenvalue problem A*x = lambda*x

Singular value problem

Selction of algorithm depends on whether the system is sparse or dense.

Distinction among the six problems

In theory, the first two problems (A*x = b and least squares) can be solved exactly in finite number of steps, O(n^3)

The last two problems (eigenvalue and singular value) can only be solved approximately. Usually an algorithm involves two phases:

-Initial reduction to a simple matrix. This step is usually difficult to convert to BLAS3

-Iterative phase. It is cheaper than the reduction phase in symmetric case and more expensive in general case

Gaussian Elimination (direct method for dense matrices)

A generic algorithm

for i = 1 to n-1
    for j = i+1 to n
	A(j,i) = A(j,i)/A(i,i)
    end;
    for j = i+1 to n
	for k = i+1 to n
	    A(j,k) = A(j,k) - A(j,i)*A(i,k)
	end
    end
end

Vector version

for i = 1 to n-1
    A(i+1:n,i) = A(i+1:n,i)/A(i,i); --------------- BLAS1
    A(i+1:n,i+1:n) = A(i+1:n,i+1:n) - A(i+1:n,i)*A(i,i+1:n)
				    --------------- BLAS3
end

Serial time: (2/3)*n^3 + O(n^2);

PRAM time: 3*(n-1).

Example (partial pivoting)

for i = 1 to n-1
    find the index of max_{i<=j<=n} |A(j,i)|;
    record the index in p(i);
    if |A(p(i),i)| < epsilon, singular; exit end;
    if p(i) != i, swap A(i,:) and A(p(i),:) end;
    A(i+1:n,i) = A(i+1:n,i)/A(i,i);
    A(i+1:n,i+1:n) = A(i+1:n,i+1:n) - A(i+1:n,i)*A(i,i+1:n)
end.

The above algorithm computes the decomposition: P*A = L*U where P is a permutation, L is a lower triangular matrix with unit diagonal, and U is upper triangular.

Reorganize GEPP to use BLAS3. The idea is to delay updating and save up a sequence of BLAS2 operations and apply them all at once using BLAS3. The block column size b (number of columns in a block) should be chosen such that it is small enough so that the block column fits in fast memory; also, it is large enough to make matrix-matrix multiplication fast.

for i = 1 to n-1 step b
    end = min(ib+b-1,n);	% locate a block column
    % LU factorize A(ib:n,ib:end)
    for i = ib to end
	find index of max_{i<=j<=n} |A(j,i)|;
	record the index in p(i);
	if singular, exit;
	swap A(p(i),:) and A(i,:);
	A(i+1:n,i) = A(i+1:n,i)/A(i,i);
	% only update inside the block, BLAS2
	A(i+1:n,i+1:end) = A(i+1:n,i+1:end) - A(i+1:n,i)*A(i,i+1:end);
    end
    % let LL = A(ib:end,ib:end), the lower triangular
    A(ib:end,end+1:n) = LL^{-1}*A(ib:end,end+1:n);
    % delayed update, BLAS3
    A(end+1:n,end+1:n) = A(end+1:n,end+1:n)
		         - A(end+1:n,ib:end)*A(ib:end,end+1:n)
end.

Observation: As Gaussian elimination proceeds, it updates smaller blocks A(end+1:n, end+1:n).

Column Blocked Layout

poor load balance

Column Cyclic Layout

good load balance

cannot use BLAS2 to factorize A(ib:n, ib:end)

cannot use BLAS3 to update A(end+1:n, end+1:n)

Column Block Cyclic Layout

slightly worse load balance

can use BLAS2 and BLAS3

factorization of A(ib:n, ib:end) can take place on one processor, a serial bottleneck

Row and Column (2D) Block Cyclic

Notations

(pi,pj): index of a processor, 0<= pi < prow, 0<= pj < pcol

p = pi*pj: number of processors

brow: number of rows in a block

bcol: number of columns in a block

(:,pj): processor row

(pi,:): processor column

Mapping

A block A(i,j) is mapped to a processor (pi,pj) where pi = floor(i/brow) mod prow and pj = floor(j/bcol) mod pcol.

Example. A matrix of order n=16, block size 2x2 (brow=bcol=2), thus it is an 8x8 block matrix, using 4 processors with prow=pcol=2.

Distributed Gaussian Elimination with 2D Block Cyclic Layout

Assume b = brow = bcol.

Step 1. Reduction among prow processors in the current column to find the pivot and broadcast its index.

Step 2. Exchange rows in the column and broadcast the pivot row.

Step 3. Compute multipliers (local) using BLAS1.

Step 4. Update A(ib:n, ib:end) (local) using BLAS2.

Step 5. Broadcast pivot information to all other processors in the row.

Step 6. Swap rows.

Step 7. Send LL to the row.

Step 8. Solve for X in LL*X = A(ib:end, end+1:n) (a set of lower triangular systems). The solution X overwrites A(ib:end, end+1:n). (local)

Step 9. Send blocks down for updating. Assuming tree-based broadcast, the communication time is

	log2(prow)*(alpha + (b*(n-end)/pcol)*beta)

Step 10. Send blocks to the right for updating. Assuming ring-based broadcast, the communication time is

	2*(alpha + (b*(n-end)/prow)*beta)

Note the we only count the left-most processor (send and receive), the others can overlap.

Step 11. Update A(end+1:n,end+1:n). The computation time is

	2*b*(n-end)^2/p

Steps 9 and 10 dominate the communication time and stpe 11 dominates the computation time. The total time is approximately

	  alpha*(n/b*(log2(prow) + 2))
	+ beta*(log2(prow)/(2*pcol) + 1/prow)*n^2
	+ (2/3)*n^3/p

      =   alpha*(n*(6 + log2(prow)))
	+ beta*(n^2*(2*(prow - 1)/p + (pcol - 1)/p + log2(prow)/(2*pcol)))
	+ (2/3)*n^3/p

The efficiency is the reciprocal of

	  1 + alpha*(1.5p*(6+log2(prow))/n^2
	+ beta*((1.5/n)*(2*prow+pcol+prow*log2(prow)/2-3))

Remarks

Efficiency increases as alpha and beta (communication) decreases. When alpha=beta=0, efficiency is 100%, which means load balance.

Fixing p, alpha, and beta, efficiency increases as n increases.

Efficiency is maintained if n increases at the same rate as sqrt(p), assuming prow ~ pcol ~ sqrt(p)

To maximize the efficiency, prow should be slightly smaller than sqrt(p)