Dynamic Programming

References

V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing, Chapter 8, The Benjamin/Cumming Publishing Company, Inc., 1994.

Dynamic programming (DP) is a technique for solving discrete optimization problems. It solves the whole problem by solving subproblems. It starts with the lowest level, usually simple subproblems, then moves up levels and finally solves the problem. The solution to a subproblem is expressed as a function of solutions to one or more subproblems at the preceeding levels.

Example. The shortest-path problem.

Notations

nodes: 0, 1, ..., n-1

cost: c(i,j), i solution to subproblem: f(x) is the shortest path from 0 to x.

Thus f(n-1) is the solution to the whole problem.

Formulation

	        0				x=0
	f(x) =
	        min_{0<=j

Classification of DP

Monadic: Functional equation that contains a single recursive term.

Polyadic: Functional equation that contains multiple recursive
terms.

Serial: Subproblems at all levels depend on only the results
at the immediately proceeding level.


Example of serial monadic DP formulation. A shortest-path
problem where the nodes can be organized into levels.


Notations

C_l(i): minimum cost from v_l(i) to R

c_l: vector (c_l(0), c_l(1), ..., c_l(n-1))'

Thus the solution is c_0 = (c_0(0))

Formulation

	C_l(i) = min_j(c_l(i,j) + C_{l+1}(j)

Clearly,
	c_{r-1} = (c_{r-1}(0,R), c_{r-1}(1,R), ..., c_{r-1}(n-1,R))'

In general, write the above equation in matrix form
		c_l = M_{l,l+1} x c_{l+1}
where
	            c_l(0,0)    c_l(0,1)  ...  c_l(0,n-1)
		    c_l(1,0)    c_l(1,1)  ...  c_l(1,n-1)
	M(l,l+1) =     .           .               .
		       .           .               .
		    c_l(n-1,0)  c_l(n-1,1) ... c_l(n-1,n-1)

In M_{l,l+1) x c_{l+1), multiplication is replaced by + and
addition is replaced by min. Parallel matrix-vector multiplication
algorithms can be used.

Example of Nonserial monadic DP formulation.

The longest-common-subsequence prolbem.


Notations

F[i,j]: the longest common subsequence of the first
i elements of A and the first j elements of B,

Thus the solution is F[n,m].


Formulation

	          0				if i=0 or j=0
	F[i,j] =  F[i-1,j-1] + 1		if i,j>0 and xi=y_i
	          max{F[i,j-1], F[i-1,j]}	if i,j>0 and xi neq yj


Sequential Implementation

Matrix F:

	F[0,0]  F[0,1]  F[0,2]  ...  F[0,m]
	F[1,0]  F[1,1]  F[1,2]  ...  F[1,m]
	F[2,0]  F[2,1]  F[2,2]  ...  F[2,m]
          .        .       .            .
          .        .       .            .
	  .        .       .            .
	F[n,0]  F[n,1]  F[n,2]  ...  F[n,m]

Algorithm
	Initial conditions: F[0,:], F[:,0];
	for k=1 to 2*n-1
	    for each entry F[i,j] on the anti-diagonal (i+j=k)
		use F[i-1,j] (north), F[i,j-1] (west),
		and F[i-1,j-1] (north-west) to compute F[i,j]
	    end
	end


Serial time: O(mn)
PRAM time: O(n), assuming m=n, using n processors, 2n-1 steps


Column Layout

Use n processors, each has a column of F.

	total time = sum_{k=1}^{2*n-1} (alpha + beta -------- communication
		                        + 1 ----------------- computation
				       )
		   = (2*n-1)*(alpha + beta + 1)


                                    n^2
	Efficiency =  ------------------------------ <= 0.5
		       n*(2*n-1)*(alpha + beta + 1)


This upper bound (0.5) for the efficiency is due to the
poor load balance.

Example of Serial Polyadic DP Problem

Floy's All-Pairs Shortest-Paths Algorithm

Notations
G(V,E): graph

c(i,j): weight of the edge from i to j

d(i,j): cost of the shortest path between i and j

d_k(i,j): cost of the shortest path between i and j
using only nodes v_0, v_1, ..., v_{k-1}


Solution: d_n(i,j)


Formulation:

		    c(i,j)				              k=0
	d_k(i,j) =
		    min{d_{k-1}(i,j), (d_{k-1}(i,k) + d_{k-1}(k,j))}  0

Algorithm:
	for k=1 to n
	  for 0 <= i, j <= n-1
	    d_k(i,j) = min{d_{k-1}(i,j), (d_{k-1}(i,k) + d_{k-1}(k,j))}
	  end
	end


Serial time:
	sum_{k=1}^n sum_{i=0}^{n-1} sum_{j=0}^{n-1} (2) ------ min and +
       = O(n^3)


PRAM time: O(n), use n^2 processors

Row-Column Block Layout

Use p processors, each has an (n/sqrt(p))-by-(n/sqrt(p)) block

Algorithm
	for k=1 to n
	    broadcast d_{k-1}(:,k) in rows;
	    broadcast d_{k-1}(k,:) in columns;
	    compute d_k(i,j) (local)
	end.


Assuming ring-based broadcast.

In iteration k=1, it takes sqrt(p) steps to send p(0,j) share of
d_0(0,:) to d_0(s-1,:) where s = sqrt(p). The time is

	s*(alpha + (n/s)*beta),		s = sqrt(p)

In the following iterations, the communication can be pipelined,
a message arrives every time interval of alpha + (n/s)*beta,
the computation time is
2*(n^2/p). When n is large, computation

is longer than communication, so they overlap.
In the last iteration, it takes sqrt(p) steps to send
p(s-1,j) share of d_{n-1}(n-1,:) to d_{n-1}(0,:). Similar
to the first iteration, the time is
        s*(alpha + (n/s)*beta),         s = sqrt(p)


Total time
	  2*s*(alpha + (n/s)*beta) ------------------- communication
	+ sum_{k=1}^{n-1} (2*n^2/p) ------------------ computation
	= O(s)*alpha + O(n)*beta + O(n^3/p)


Efficiency
                         1
       ----------------------------------------
	1 + (p^{1.5}/n^3)*alpha + (p/n^2)*beta


Remarks


When n>>p, efficiency is near 1.0

To maintain the efficiency, n should increase at the
same rate as sqrt(p)