Performance

References

IPC: 4.1--4.7

Demmel: L3, L5

MPI: 12.1

Metrics

Run time

    T(1,n): serial run time
    T(p,n): parallel run time using p processors

As denoted above, the serial run time is a function of problem size n and the parallel run time is a function of the number of processors, the problem size, and communication time.

Speedup

    speedup(p) = T(1,n) / T(p,n)

In the speedup, T(1,n) is the rum time for the best serial program which can be different from the parallel one. Usually, the speedup is less than p since we can always use a single processor to simulate p processors by executing their codes in sequence. However, superlinear speedup is possible. For example, memory hierarchy can save time when data are distributed among processors. Another example is the case when the program is nondeterministic. We will see an example in nonnumerical algorithms.

Efficiency

    E = speedup(p) / p

Cost: p*T(p,n). A program is cost optimal if (p*T(p,n)) / T(1,n) is independent of both p and n.

Amdahl's Law

    f: parallelizable fraction
    s: serial part (s + f = 1)

    speedup(p) = 1 / (s + f/p)

Amdahl's law tells us that it is important to identify serial bottleneck (the part we spend a lot of time). How to identify the bottleneck? Profiling program is very useful. The development of programs that achieve efficiency near 1 on a 1024-processor shows that the serial fraction could be almost zero. It seems amazing. In fact, s (or f) can be a function of the problem size n. In many cases, f tends to 1 when n is large. Thus we introduce

    scaled speedup = T(1, n(p)) / T(p, n(p))

    scaled efficiency = scaled speedup / p

Scalability

As we have seen, usually when problem size increases efficiency increases; when the number of processors increases efficiency decreases. Can we keep the efficiency fixed when the problem size and the number of processors increase simultaneously? That leads to the scalability. Recall that a program is cost-optimal if p*T(p,n)/T(1,n) is a constant. So, a scalable parallel system can always be made cost-optimal if the number of processors and the size of problem are chosen appropriately

Isoefficiency function

Problem size (w): number of basic operations (floating-point additions and multiplications) in the best serial algorithm to solve the problem. For example, the problem size of matrix multiplication is O(n^3) where n the order of the matrices.

Time unit: time for one basic operation. Thus the serial time T(1,n) = w.

Normally, the efficiency is less than 1. We define the overhead:

    T_o = p*T(p,n) - w > 0

Sources of overhead:

communication

load imbalance (idle time)

redundant computation

From the definition of overhead, we get the parallel time

    T(p,n) = (T_o + w)/p

and the efficiency

    E = w/(p*T(p,n)) = 1/(1 + T_o/w)

The above equation shows that if the overhead grows faster than the problem size (w), then the efficiency (E) decreases. This is undesirable. To maintain the efficiency, we must have

    w/T_o = const

In general, T_o is a function of w and p, since T(p,n) is a function of n (thus w) and p and T_o = p*T(p,n) - w.

Example

Consider the problem of adding n numbers on a p-processor hypercube. Adding n/p numbers locally requires (n/p - 1) operations (floating-point additions). Adding p partial sums in parallel on a hypercube takes log(p) + log(p)*T_comm. Thus the total time is

    T(p,n) ~ n/p + log(p)*(1 + T_comm)

Obviously, the serial time is w = n-1 ~ n. So, the overhead is

    T_o = p*T(p,n) - w = p*(1 + T_comm)*log(p)

and the isoefficiency function is given by

          w                             w
   ---------------- = const   or    ---------- = const
    p*(1 + T_comm)                   p*log(p)

In this case, w = n, so the above equation means that if p increases to p', to maintain the efficiency, the problem size must increase from n to n*((p'*log(p')/(p*log(p))).

When T_o has multiple terms, we use the term that gives the highest rate. Suppose T_o = p^(3/2) + p^(3/4)*w^(3/4), then to keep w/T_o = const, we have

                  1
    ----------------------------- = const
     p^(3/2)/w + p^(3/4)/w^(1/4)

If w grows at the rate of p^3 (w = O(p^3)), then p^(3/4)/w^(1/4) = const and

                  1                          1
    ----------------------------- = --------------------
     p^(3/2)/w + p^(3/4)/w^(1/4)     O(p^(-3/2)) + O(1)

tends to a constant as p is very large. The parallel system is asymptotically isoefficient.

Note that the isoefficiency equation shows that to maintain the efficiency (isoefficient), problem size must grow at least at the rate of O(p). From const = w/T_o, it follows that const*p*T(p,n) = (1 + const)*w, or

    w = const*p*T(p,n)

Thus, w is at least proportional to p.

Sharks and fish problem

This is a collection of parallel program problem. Sharks and fish simulate moving particles in a 2D space following some physical rules. Details and come working implementations of these problems can be found in Demmel's cs267.

The first sharks and fish problem is embarrassingly parallel.

In the second sharks and fish problem, every fish needs the positions of all fish to calculate the gravity.

A sequential solution

  pos(1:n): positions, each pos(i) is a 2-D array or a complex number
  mass(1:n): masses
  dir(1:n): direction of forces
  force(1:n): forces
  vel(1:n): velocities
  accel(1:n): accelerations
  dt: time step
  tfinal: total time

  formulas:
        f = m1*m2/(r*r) (ignore the constant)
        v = a*t
        d = v*t + 0.5*a*t*t

  t = 0.0;
  while t < tfinal
      for i = 1:n
          force(i) = 0.0;
          for j = 1:n & i ~= j
              dir(j) = (pos(j) - pos(i))/||pos(j) - pos(i)||;
              force(i) = force(i) + (mass(j)*mass(i)*dir(j)/||pos(j)-pos(i)||^2;
          end
          accel(i) = force(i)/mass(i);
          pos(i) = pos(i) + dt*(vel(i) + 0.5*accel(i)*dt);
          vel(i) = vel(i) + dt*accel;
      end
      t = t + dt;
  end.

Simple solution: Every fish broadcasts its position and receives the positions of all other fish. We could use MPI_Allgather function.

Second solution: Suppose there are p processes, each process has the information of N = n/p fish (positions, masses, forces, etc.). A second copy of all fish (gpos(1:N), gmass(1:N)) is distributed among processors and rotated.

  t = 0.0;
  while t < tfinal
      for i = 1:n-1
          send([gpos(N) gmass(N)]) to (myId+1) mod p;
          shift gpos and gmass to right by one;
          recv([gpos(1) gmass(1)]) from (myId-1) mod p;
          dir = (gpos - pos)/||gpos - pos||;
          force = force + (gmass*mass*dir/||gpos-pos||^2;
      end
      accel = force/mass;
      pos = pos + dt*(vel + 0.5*accel*dt);
      vel = vel + dt*accel;
      barrier
      t = t + dt;
  end.

Thus, there are (total_fish - 1) rotations (communications). The total communication time is

T_comm = (total_fish - 1)*t_s + (total_fish - 1)*t_w

assuming the information about one fish fits in one packet. Note that this program may cause deadlock if synchronized send and receive are used.

Third solution: The second copy is first rotated locally, then the processors circulate their parts of the the second copy. Thus there are (p - 1) communications and the total communication time is

T_comm = (p - 1)*(t_s + (total_fish/p)*t_w)

assuming the information about each fish takes one packet. Usually, we can pack them together.

Since total_fish is very large and total_fish>>p (otherwise, we would not use parallel computing), and t_s>>t_w, the third solution is better than the second. As we said earlier, we prefer sending fewer and larger messages.