Assignment 2: Due Thur., November 9, 5 pm


In this assignment, we implement the Cholesky factorization in a multithread environment. Assume that the n-by-n SPD matrix A is stored in the shared memeory and that we have p threads, p < < n, in addition to the main thread.

The idea is to let each thread compute some columns of the factor L. Comparing the jth column in the equation A = LL^T,

        A(:,j) = SUM_{k=1}^{j} L(:,k)*L(j,k)
Define
        s(j:n) = A(j:n,j) - SUM_{k=1}^{j-1} L(j:n,k)*L(j,k)
we obtain the key result
        L(j:n,j)*L(j,j) = s(j:n)
and
        L(j:n,j) = s(j:n)/sqrt(s(j))
Thus as soon as the kth column L(:,k) is available, we can update
        s(j:n) <- s(j:n) - L(j:n,k)*L(j,k)
for all j>k where s(j:n) is initialized as A(j:n,j). The above one step of update requires 2(n-j) flops. Since this is a decreasing function of j, work load is imbalanced if each thread handles a block of consecutive columns. So, we assign columns t:p:n to thread t. For the case n=22 and p=4, the column assignment is
           Thread             Columns
           ---------------------------------
              1         1  5   9  13  17  21
              2         2  6  10  14  18  22
              3         3  7  11  15  19
              4         4  8  12  16  20
Assuming the A is in shared memory and the computed L overwrites the lower part of A, here is the thread function for thread(t):
    myCols = t:p:n
    for k = 1:n
       if any(myCols == k)
          % generate an L-column
          A(k:n,k) = A(k:n,k)/sqrt(A(k,k);
       end
       barrier
       % update my columns whose indices are greater than k
       for j = k+1:n
          if any(myCols == j)
             A(j:n,j) = A(j:n,j) - A(j:n,k)*A(j,k);
          end
       end
       barrier
    end
Why do we need two barriers?

Measure the time and compare it with the sequential ones.