The major goal of decomposition is to expose enough concurrency, but not so much that the overhead of managing the tasks becomes substantial compared to the useful work done. Identify major computations because of Amdahl's Law.
The primary goal of assignment is to balance the workload among processes. Static assignment and dynamic assignment.
The goals of orchestration are to execute the tasks assigned to
processes, to access data, to exchange data (communication), and to
synchronize processes.
Issues involved in orchestration:
The performance goals:
Basically, there are two kinds of mapping: space-sharing and time-sharing. In space-sharing, only a single program runs at a time in a subset of processors in a machine. Mix of space-sharing and time-sharing is typical.
An (n+2)-by-(n+2) grid including boundaries.
procedure Solve(float **A) { int i,j,done=0; float diff=0, temp; while (!done) { diff = 0; % reset for i=1:n { % sweep through grid for j=1:n { temp = A(i,j); % save A(i,j) A(i,j) = 0.2*(A(i,j) + A(i,j-1) + A(i-1,j) + A(i,j+1) + A(i+1,j)); % update A(i,j) diff += abs(A(i,j) - temp); % measure error } % for j } % for i if (diff/(n*n) < TOL) % converge done = 1; } % while
Decomposition
Program-structure based approach: Parallelizing loop, then parallelizing among loops. Usually, loops contribute the major computation. (Amdahl's Law)
The inner loops (for i and for j) are sequentially dependent.
The outer loop (while) is also sequentially dependent.
Let's go back to fundamental data dependence. Elements along an antidiagonal are independent.
While (!done) { diff = 0.0; for_all i=1:n { for_all j=1:n { temp = A(i,j); A(i,j) = 0.2*( ... ); diff += abs(A(i,j) - temp); } } if (diff/(n*n) < TOL) done = 1; }
Assignment
Orchestration
Architecture dependent.
1. Data parallel model
procedure Solve(float **A) { int i,j,done=0; float mydiff=0.0, temp; DECOMP A[BLOCK, *, nprocs]; % partition A into block rows among % processes, columns are not partitioned while (!done) { mydiff = 0.0; for_all i=1:n { for_all j=1:n { temp = A(i,j); A(i,j) = 0.2*( ... ); mydiff += abs(A(i,j) - temp); } } REDUCE(mydiff, diff, ADD); if (diff/(n*n) < TOL) done = 1; } }
Shared address space model
main() % main thread { initialize(A); CREATE(nprocs-1, Solve, A); % create threads running Solve Solve(A); % main thread calls Solve WAIT_FOR_END(nprocs-1); % wait for threads to finish } procedure Solve(float **A) { int i,j,myid=getid(), done=0; float mydiff=0.0, temp; int mymin = 1 + (myid*n/nprocs); int mymax = mymin + n/nprocs - 1; while (!done) { mydiff = diff = 0.0; BARRIER(bar1, nprocs); % global synch for i=mymin:mymax { for j=1:n { temp = A(i,j); A(i,j) = 0.2*( ... ); mydiff += abs(A(i,j) - temp); } } LOCK(diff_lock); % access global diff diff += mydiff; UNLOCK(diff_lock); BARRIER(bar1, nprocs); % wait until all threads finish sweep if (diff/(n*n) < TOL) done = 1; BARRIER(bar1, nprocs); % synch before starting next sweep } }
Message passing model
Each process allocates an array of size (n/nprocs+2)-by-(n+2). Note, extra rows to simplify programming.
float **myblock; main() { CREATE(nprocs-1, Solve); % create processes to run Solve Solve(); WAIT_FOR_END(nprocs-1); } procedure Solve() { int i,j,myid, size=n/nprocs, done=0; float temp, tempdiff, mydiff=0.0; initialize(myblock); % may be received from host while (!done) { if (myid != 0) SEND(&myblock(1,1), n*sizeof(float), myid-1, ROW); if (myid != nprocs -1) SEND(&myblock(size,1), n*sizeof(float), myid+1, ROW); if (myid != 0) RECEIVE(&myblock(0,1), n*sizeof(float), myid-1, ROW); if (myid != nprocs -1) RECEIVE(&myblock(size+1,1), n*sizeof(float), myid+1, ROW); for i=1:size { for j=1:n { temp = myblock(i,j); myblock(i,j) = 0.2*( ... ); mydiff += abs(myblock(i,j) - temp); } } if (myid != 0) { SEND(mydiff, sizeof(float), 0, DIFF); RECEIVE(done, sizeof(int), 0, DONE); } else { for i=1:nprocs-1 { RECEIVE(tempdiff, sizeof(float), any, DIFF); mydiff += tempdiff; } if (mydiff/(n*n) < TOL) done = 1; for i=1:nprocs-1 SEND(done, sizeof(int), i, DONE); } } }
Deadlock can occur if synch send and receive are used. Blocking asynchronous send and receive or nonblocking asynchronous send and receive may be used.