The idea is to let each thread compute some columns of the factor L. Comparing the jth column in the equation A = LL^T,
A(:,j) = SUM_{k=1}^{j} L(:,k)*L(j,k)
Define
s(j:n) = A(j:n,j) - SUM_{k=1}^{j-1} L(j:n,k)*L(j,k)
we obtain the key result
L(j:n,j)*L(j,j) = s(j:n)
and
L(j:n,j) = s(j:n)/sqrt(s(j))
Thus as soon as the kth column L(:,k) is available, we can update
s(j:n) <- s(j:n) - L(j:n,k)*L(j,k)
for all j>k where s(j:n) is initialized as A(j:n,j). The above one
step of update requires 2(n-j) flops. Since this is a decreasing
function of j, work load is imbalanced if each thread handles
a block of consecutive columns. So, we assign columns t:p:n to
thread t. For the case n=22 and p=4, the column assignment is
Thread Columns
---------------------------------
1 1 5 9 13 17 21
2 2 6 10 14 18 22
3 3 7 11 15 19
4 4 8 12 16 20
Assuming the A is in shared memory and the computed L overwrites the
lower part of A, here is the thread function for thread(t):
myCols = t:p:n
for k = 1:n
if any(myCols == k)
% generate an L-column
A(k:n,k) = A(k:n,k)/sqrt(A(k,k);
end
barrier
% update my columns whose indices are greater than k
for j = k+1:n
if any(myCols == j)
A(j:n,j) = A(j:n,j) - A(j:n,k)*A(j,k);
end
end
barrier
end
Why do we need two barriers? Measure the time and compare it with the sequential ones.