Parallel Programming

Requirements

Balance work among processes

Reduce communication

Keep communication and synchronization overhead low

Steps
1. Decomposition of computation into tasks
2. Assignment of tasks to processes
3. Orchestration of data access, communication, and synchronization
4. Mapping of processes to processors

The major goal of decomposition is to expose enough concurrency, but not so much that the overhead of managing the tasks becomes substantial compared to the useful work done. Identify major computations because of Amdahl's Law.

The primary goal of assignment is to balance the workload among processes. Static assignment and dynamic assignment.

The goals of orchestration are to execute the tasks assigned to processes, to access data, to exchange data (communication), and to synchronize processes.
Issues involved in orchestration:

How to organize data structures

How to schedule the tasks assigned to a process temporally to exploit data locality

Whether to communicate implicitly or explicitly and in small or large messages

How to organize and express interprocess communication and synchronization

The performance goals:

Reduce the cost of communication and synchronization as seen by the processors

Exploit locality

Schedule tasks to reduce idle time (waiting for intermediate results)

Reduce the overhead of parallelism management

Basically, there are two kinds of mapping: space-sharing and time-sharing. In space-sharing, only a single program runs at a time in a subset of processors in a machine. Mix of space-sharing and time-sharing is typical.

Example. Iterative linear system solver

An (n+2)-by-(n+2) grid including boundaries.

  procedure Solve(float **A)
  {
    int   i,j,done=0;
    float diff=0, temp;

    while (!done) {
      diff = 0;               % reset
      for i=1:n {             % sweep through grid
        for j=1:n {
          temp = A(i,j);      % save A(i,j)
          A(i,j) = 0.2*(A(i,j) + A(i,j-1) + A(i-1,j) + A(i,j+1) + A(i+1,j));
                              % update A(i,j)
          diff += abs(A(i,j) - temp);
                              % measure error
        } % for j
      } % for i
      if (diff/(n*n) < TOL)  % converge
        done = 1;
    } % while

Decomposition

Program-structure based approach: Parallelizing loop, then parallelizing among loops. Usually, loops contribute the major computation. (Amdahl's Law)

The inner loops (for i and for j) are sequentially dependent.

The outer loop (while) is also sequentially dependent.

Let's go back to fundamental data dependence. Elements along an antidiagonal are independent.

Keep the loops something like data flow by inserting point-to-point sychronizaion. Make suer a point waits until the points above and to its left are updated. Synchronization overhead is too high.

Change the loop structure to loop through antidiagonal using global synchronization, e.g., barrier. Still the synchronization overhead is too high.

Change algorithm using red-black ordering. Convergence?

Don't care about the ordering. Convergence?

Leave to hardware/software systems using for_all.

    While (!done) {
      diff = 0.0;
      for_all i=1:n {
        for_all j=1:n {
          temp = A(i,j);
          A(i,j) = 0.2*( ... );
          diff += abs(A(i,j) - temp);
        }
      }
      if (diff/(n*n) < TOL) done = 1;
    }

Assignment

Block row: Each process is assigned a block of contiguous rows. Row i (i=1...n) is assigned to process

                 floor((i-1)/(n/p))

Cyclic row: Row i is assigned to process

                 (i-1) mod p

Dynamic assignment: Each process grabs a row after it finishes a row.

Orchestration

Architecture dependent.

1. Data parallel model

  procedure Solve(float **A)
  {
    int   i,j,done=0;
    float mydiff=0.0, temp;

    DECOMP A[BLOCK, *, nprocs];      % partition A into block rows among
                                     % processes, columns are not partitioned 
    while (!done) {
      mydiff = 0.0;
      for_all i=1:n {
        for_all j=1:n {
          temp = A(i,j);
          A(i,j) = 0.2*( ... );
          mydiff += abs(A(i,j) - temp);
        }
      }

      REDUCE(mydiff, diff, ADD);

      if (diff/(n*n) < TOL) done = 1;
    }
  }

Shared address space model


  main()                          % main thread
  {
    initialize(A);

    CREATE(nprocs-1, Solve, A);   % create threads running Solve

    Solve(A);                     % main thread calls Solve

    WAIT_FOR_END(nprocs-1);       % wait for threads to finish
  }


  procedure Solve(float **A)
  {
    int   i,j,myid=getid(), done=0;
    float mydiff=0.0, temp;
    int   mymin = 1 + (myid*n/nprocs);
    int   mymax = mymin + n/nprocs - 1;

    while (!done) {
      mydiff = diff = 0.0;

      BARRIER(bar1, nprocs);       % global synch

      for i=mymin:mymax {
        for j=1:n {
          temp = A(i,j);
          A(i,j) = 0.2*( ... );
          mydiff += abs(A(i,j) - temp);
        }
      }

      LOCK(diff_lock);             % access global diff
      diff += mydiff;
      UNLOCK(diff_lock);

      BARRIER(bar1, nprocs);       % wait until all threads finish sweep

      if (diff/(n*n) < TOL) done = 1;

      BARRIER(bar1, nprocs);       % synch before starting next sweep
    } 
  }

Message passing model

Each process allocates an array of size (n/nprocs+2)-by-(n+2). Note, extra rows to simplify programming.


  float **myblock;

  main()
  {
    CREATE(nprocs-1, Solve);        % create processes to run Solve
    Solve();
    WAIT_FOR_END(nprocs-1); 
  }


  procedure Solve()
  {
    int   i,j,myid, size=n/nprocs, done=0;
    float temp, tempdiff, mydiff=0.0;

    initialize(myblock);            % may be received from host

    while (!done) {
      if (myid != 0)
        SEND(&myblock(1,1), n*sizeof(float), myid-1, ROW);
      if (myid != nprocs -1)
        SEND(&myblock(size,1), n*sizeof(float), myid+1, ROW);

      if (myid != 0)
        RECEIVE(&myblock(0,1), n*sizeof(float), myid-1, ROW);
      if (myid != nprocs -1)
        RECEIVE(&myblock(size+1,1), n*sizeof(float), myid+1, ROW);

      for i=1:size {
        for j=1:n {
          temp = myblock(i,j);
          myblock(i,j) = 0.2*( ... );
          mydiff += abs(myblock(i,j) - temp);
        }
      }

      if (myid != 0) {
        SEND(mydiff, sizeof(float), 0, DIFF);
        RECEIVE(done, sizeof(int), 0, DONE);
      } else {
        for i=1:nprocs-1 {
          RECEIVE(tempdiff, sizeof(float), any, DIFF);
          mydiff += tempdiff;
        }
        if (mydiff/(n*n) < TOL) done = 1;
        for i=1:nprocs-1
          SEND(done, sizeof(int), i, DONE);
      }
    }
  }

Deadlock can occur if synch send and receive are used. Blocking asynchronous send and receive or nonblocking asynchronous send and receive may be used.