Major issues in solving linear systems
Six most important problems of linear algebra
Selction of algorithm depends on whether the system is sparse or dense.
Distinction among the six problems
for i = 1 to n-1 for j = i+1 to n A(j,i) = A(j,i)/A(i,i) end; for j = i+1 to n for k = i+1 to n A(j,k) = A(j,k) - A(j,i)*A(i,k) end end endVector version
for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i)/A(i,i); --------------- BLAS1 A(i+1:n,i+1:n) = A(i+1:n,i+1:n) - A(i+1:n,i)*A(i,i+1:n) --------------- BLAS3 endSerial time: (2/3)*n^3 + O(n^2);
PRAM time: 3*(n-1).
Example (partial pivoting)
for i = 1 to n-1 find the index of max_{i<=j<=n} |A(j,i)|; record the index in p(i); if |A(p(i),i)| < epsilon, singular; exit end; if p(i) != i, swap A(i,:) and A(p(i),:) end; A(i+1:n,i) = A(i+1:n,i)/A(i,i); A(i+1:n,i+1:n) = A(i+1:n,i+1:n) - A(i+1:n,i)*A(i,i+1:n) end.The above algorithm computes the decomposition: P*A = L*U where P is a permutation, L is a lower triangular matrix with unit diagonal, and U is upper triangular.
Reorganize GEPP to use BLAS3. The idea is to delay updating and save up a sequence of BLAS2 operations and apply them all at once using BLAS3. The block column size b (number of columns in a block) should be chosen such that it is small enough so that the block column fits in fast memory; also, it is large enough to make matrix-matrix multiplication fast.
for i = 1 to n-1 step b end = min(ib+b-1,n); % locate a block column % LU factorize A(ib:n,ib:end) for i = ib to end find index of max_{i<=j<=n} |A(j,i)|; record the index in p(i); if singular, exit; swap A(p(i),:) and A(i,:); A(i+1:n,i) = A(i+1:n,i)/A(i,i); % only update inside the block, BLAS2 A(i+1:n,i+1:end) = A(i+1:n,i+1:end) - A(i+1:n,i)*A(i,i+1:end); end % let LL = A(ib:end,ib:end), the lower triangular A(ib:end,end+1:n) = LL^{-1}*A(ib:end,end+1:n); % delayed update, BLAS3 A(end+1:n,end+1:n) = A(end+1:n,end+1:n) - A(end+1:n,ib:end)*A(ib:end,end+1:n) end.Observation: As Gaussian elimination proceeds, it updates smaller blocks A(end+1:n, end+1:n).
(pi,pj): index of a processor, 0<= pi < prow, 0<= pj < pcol
p = pi*pj: number of processors
brow: number of rows in a block
bcol: number of columns in a block
(:,pj): processor row
(pi,:): processor column
Mapping
A block A(i,j) is mapped to a processor (pi,pj) where pi = floor(i/brow) mod prow and pj = floor(j/bcol) mod pcol.
Example. A matrix of order n=16, block size 2x2 (brow=bcol=2), thus it is an 8x8 block matrix, using 4 processors with prow=pcol=2.
2*b*(n-end)^2/pSteps 9 and 10 dominate the communication time and stpe 11 dominates the computation time. The total time is approximately
alpha*(n/b*(log2(prow) + 2)) + beta*(log2(prow)/(2*pcol) + 1/prow)*n^2 + (2/3)*n^3/p = alpha*(n*(6 + log2(prow))) + beta*(n^2*(2*(prow - 1)/p + (pcol - 1)/p + log2(prow)/(2*pcol))) + (2/3)*n^3/pThe efficiency is the reciprocal of
1 + alpha*(1.5p*(6+log2(prow))/n^2 + beta*((1.5/n)*(2*prow+pcol+prow*log2(prow)/2-3))Remarks