The idea is to let each thread compute some columns of the factor L. Comparing the jth column in the equation A = LL^T,
A(:,j) = SUM_{k=1}^{j} L(:,k)*L(j,k)Define
s(j:n) = A(j:n,j) - SUM_{k=1}^{j-1} L(j:n,k)*L(j,k)we obtain the key result
L(j:n,j)*L(j,j) = s(j:n)and
L(j:n,j) = s(j:n)/sqrt(s(j))Thus as soon as the kth column L(:,k) is available, we can update
s(j:n) <- s(j:n) - L(j:n,k)*L(j,k)for all j>k where s(j:n) is initialized as A(j:n,j). The above one step of update requires 2(n-j) flops. Since this is a decreasing function of j, work load is imbalanced if each thread handles a block of consecutive columns. So, we assign columns t:p:n to thread t. For the case n=22 and p=4, the column assignment is
Thread Columns --------------------------------- 1 1 5 9 13 17 21 2 2 6 10 14 18 22 3 3 7 11 15 19 4 4 8 12 16 20Assuming the A is in shared memory and the computed L overwrites the lower part of A, here is the thread function for thread(t):
myCols = t:p:n for k = 1:n if any(myCols == k) % generate an L-column A(k:n,k) = A(k:n,k)/sqrt(A(k,k); end barrier % update my columns whose indices are greater than k for j = k+1:n if any(myCols == j) A(j:n,j) = A(j:n,j) - A(j:n,k)*A(j,k); end end barrier endWhy do we need two barriers?
Measure the time and compare it with the sequential ones.