Parallel Programming

Requirements
  • Balance work among processes
  • Reduce communication
  • Keep communication and synchronization overhead low
  • Steps
    1. Decomposition of computation into tasks
    2. Assignment of tasks to processes
    3. Orchestration of data access, communication, and synchronization
    4. Mapping of processes to processors

    The major goal of decomposition is to expose enough concurrency, but not so much that the overhead of managing the tasks becomes substantial compared to the useful work done. Identify major computations because of Amdahl's Law.

    The primary goal of assignment is to balance the workload among processes. Static assignment and dynamic assignment.

    The goals of orchestration are to execute the tasks assigned to processes, to access data, to exchange data (communication), and to synchronize processes.
    Issues involved in orchestration:

  • How to organize data structures
  • How to schedule the tasks assigned to a process temporally to exploit data locality
  • Whether to communicate implicitly or explicitly and in small or large messages
  • How to organize and express interprocess communication and synchronization
  • The performance goals:

  • Reduce the cost of communication and synchronization as seen by the processors
  • Exploit locality
  • Schedule tasks to reduce idle time (waiting for intermediate results)
  • Reduce the overhead of parallelism management
  • Basically, there are two kinds of mapping: space-sharing and time-sharing. In space-sharing, only a single program runs at a time in a subset of processors in a machine. Mix of space-sharing and time-sharing is typical.

    Example. Iterative linear system solver

    An (n+2)-by-(n+2) grid including boundaries.

      procedure Solve(float **A)
      {
        int   i,j,done=0;
        float diff=0, temp;
    
        while (!done) {
          diff = 0;               % reset
          for i=1:n {             % sweep through grid
            for j=1:n {
              temp = A(i,j);      % save A(i,j)
              A(i,j) = 0.2*(A(i,j) + A(i,j-1) + A(i-1,j) + A(i,j+1) + A(i+1,j));
                                  % update A(i,j)
              diff += abs(A(i,j) - temp);
                                  % measure error
            } % for j
          } % for i
          if (diff/(n*n) < TOL)  % converge
            done = 1;
        } % while
    

    Decomposition

    Program-structure based approach: Parallelizing loop, then parallelizing among loops. Usually, loops contribute the major computation. (Amdahl's Law)

    The inner loops (for i and for j) are sequentially dependent.

    The outer loop (while) is also sequentially dependent.

    Let's go back to fundamental data dependence. Elements along an antidiagonal are independent.

  • Keep the loops something like data flow by inserting point-to-point sychronizaion. Make suer a point waits until the points above and to its left are updated. Synchronization overhead is too high.
  • Change the loop structure to loop through antidiagonal using global synchronization, e.g., barrier. Still the synchronization overhead is too high.
  • Change algorithm using red-black ordering. Convergence?
  • Don't care about the ordering. Convergence?
  • Leave to hardware/software systems using for_all.
  •     While (!done) {
          diff = 0.0;
          for_all i=1:n {
            for_all j=1:n {
              temp = A(i,j);
              A(i,j) = 0.2*( ... );
              diff += abs(A(i,j) - temp);
            }
          }
          if (diff/(n*n) < TOL) done = 1;
        }
    

    Assignment

  • Block row: Each process is assigned a block of contiguous rows. Row i (i=1...n) is assigned to process
                     floor((i-1)/(n/p))
    
  • Cyclic row: Row i is assigned to process
                     (i-1) mod p
    
  • Dynamic assignment: Each process grabs a row after it finishes a row.
  • Orchestration

    Architecture dependent.

    1. Data parallel model

      procedure Solve(float **A)
      {
        int   i,j,done=0;
        float mydiff=0.0, temp;
    
        DECOMP A[BLOCK, *, nprocs];      % partition A into block rows among
                                         % processes, columns are not partitioned 
        while (!done) {
          mydiff = 0.0;
          for_all i=1:n {
            for_all j=1:n {
              temp = A(i,j);
              A(i,j) = 0.2*( ... );
              mydiff += abs(A(i,j) - temp);
            }
          }
    
          REDUCE(mydiff, diff, ADD);
    
          if (diff/(n*n) < TOL) done = 1;
        }
      }
    

    Shared address space model

    
      main()                          % main thread
      {
        initialize(A);
    
        CREATE(nprocs-1, Solve, A);   % create threads running Solve
    
        Solve(A);                     % main thread calls Solve
    
        WAIT_FOR_END(nprocs-1);       % wait for threads to finish
      }
    
    
      procedure Solve(float **A)
      {
        int   i,j,myid=getid(), done=0;
        float mydiff=0.0, temp;
        int   mymin = 1 + (myid*n/nprocs);
        int   mymax = mymin + n/nprocs - 1;
    
        while (!done) {
          mydiff = diff = 0.0;
    
          BARRIER(bar1, nprocs);       % global synch
    
          for i=mymin:mymax {
            for j=1:n {
              temp = A(i,j);
              A(i,j) = 0.2*( ... );
              mydiff += abs(A(i,j) - temp);
            }
          }
    
          LOCK(diff_lock);             % access global diff
          diff += mydiff;
          UNLOCK(diff_lock);
    
          BARRIER(bar1, nprocs);       % wait until all threads finish sweep
    
          if (diff/(n*n) < TOL) done = 1;
    
          BARRIER(bar1, nprocs);       % synch before starting next sweep
        } 
      }
    
    

    Message passing model

    Each process allocates an array of size (n/nprocs+2)-by-(n+2). Note, extra rows to simplify programming.

    
      float **myblock;
    
      main()
      {
        CREATE(nprocs-1, Solve);        % create processes to run Solve
        Solve();
        WAIT_FOR_END(nprocs-1); 
      }
    
    
      procedure Solve()
      {
        int   i,j,myid, size=n/nprocs, done=0;
        float temp, tempdiff, mydiff=0.0;
    
        initialize(myblock);            % may be received from host
    
        while (!done) {
          if (myid != 0)
            SEND(&myblock(1,1), n*sizeof(float), myid-1, ROW);
          if (myid != nprocs -1)
            SEND(&myblock(size,1), n*sizeof(float), myid+1, ROW);
    
          if (myid != 0)
            RECEIVE(&myblock(0,1), n*sizeof(float), myid-1, ROW);
          if (myid != nprocs -1)
            RECEIVE(&myblock(size+1,1), n*sizeof(float), myid+1, ROW);
    
          for i=1:size {
            for j=1:n {
              temp = myblock(i,j);
              myblock(i,j) = 0.2*( ... );
              mydiff += abs(myblock(i,j) - temp);
            }
          }
    
          if (myid != 0) {
            SEND(mydiff, sizeof(float), 0, DIFF);
            RECEIVE(done, sizeof(int), 0, DONE);
          } else {
            for i=1:nprocs-1 {
              RECEIVE(tempdiff, sizeof(float), any, DIFF);
              mydiff += tempdiff;
            }
            if (mydiff/(n*n) < TOL) done = 1;
            for i=1:nprocs-1
              SEND(done, sizeof(int), i, DONE);
          }
        }
      }
    
    

    Deadlock can occur if synch send and receive are used. Blocking asynchronous send and receive or nonblocking asynchronous send and receive may be used.