Communication

References

IPC: 2.3--2.10, 3.1--3.9

MPI

Demmel: L3, L5, L9(1)

Model

The communication time for one packet is modeled as

	t_{comm} = t_s + (m*t_w + t_h)*l

where t_s is the startup time (adding header, establishing connection, etc); t_w is the transfer time (seconds per word, its reciprocal is the bandwidth, i.e., words per second); t_h is the hop time (time for the header to travel from processor to its adjacent processor). Usually it is small and negligible; m is the message size (words); l is the number of links between two communicating processors. Usually, for direct connection, we use

	t_{comm} = t_s + m*t_w

The startup time t_s is usually much larger than the transfer time t_w. So, it is more efficient to send few and long messages than to send many short messages. For example, over an Ethernet, a cluster of workstations running PVM, t_s is around 10^2 to 10^5 cycles and t_w is around 1e-9 to 1e-7 seconds per byte.

Routing mechanism

Store-and-forward

	t_{comm} = t_s + (m*t_w + t_h)*l ~ t_s + m*t_w*l

Cut-through

	t_{comm} = t_s + l*t_h + m*t_w ~ t_s + m*t_w

Static interconnection networks

square p-processor wraparound mesh. Each processor has four wires, thus a total of 2*p wires and the average number of hops is sqrt(p)/2.

hypercube. Each processor has log(p) wires, a total of (p*log(p))/2 wires and the average number of hops is log(p)/2.

For the same cost (number of wires), mesh has log(p)/4 wires per link. To compare the communication time,

hypercube:

	t_{comm} = t_s + m*t_w

mesh:

        t_{comm} = t_s + m*t_w / (log(p)/4)

For p>16, mesh is better (cost/performance) than hypercube.

embedding mesh, ring, tree in hypercube

Performance analysis (broadcast)

Store-and-forward

Ring, we can broadcast in both directions.

        t_{bcast} = (t_s + m*t_w)*p/2

Mesh (2D), we first broadcast in a row then in all columns.

        t_{bcast} = (t_s + m*t_w)*sqrt(p)

Hypercube, we first broadcast to the node differs in the m.s.b., then to the nodes differ in the second m.s.b, ..., finally to the nodes differ in the l.s.b.

        t_{bcast} = (t_s + m*t_w)*log(p)

Cut-through

Ring, first P0 sends to P(p/2), then both P0 and P(p/2) send to P(p/4) and P(3p/4), ...

        t_{bcast} = (t_s + m*t_w)*log(p)

Mesh (2D), first broadcast in a row, then in all columns simultaneously.

        t_{comm} = 2*(t_s + m*t_w)*log(sqrt(p))

Hypercube, broadcast in a balanced binary tree

        t_{comm} = (t_s + m*t_w)*log(p)

Example

Parallel prefix

The add-scan of a sequence: x(0), x(1), ..., x(p-1) is:

    y(0) = x(0)
    y(1) = x(0) + x(1)
    ...
    y(p-1) = x(0) + x(1) + ... + x(p-1)

The operation "+" can be replaced by any of "*", "max", or "min".

A lower bound Any function of n parameters (e.g., x(0) + x(1) + ... + x(n-1)) require at least log(n) operations (binary). In fact, the result returned by the function depends on at most 2 parameters. It takes one step. Each of the two operands depends on at most 2 parameters. Thus in two steps, the result depends on at most four parameters. In general, a function computed in k steps has at most 2^k parameters.

A parallel algorithm for computing y(0) to y(p-1) in 2*log_2(p) - 1 steps. Assume a binary tree network of processors (actuall a linear array is sufficient).

Phase 1 (up-the-tree)

    if leaf node
        send(v, parent);
    else {
        recv(m, leftChild);
        recv(tmp, rightChild);
        if not root
            send(m+tmp, parent);
    }

After phase 1, each interior node has m the sum of all the values (v) in the leaves of its left subtree.

Phase 2 (down-the-tree)

    if root
        t = 0;
    else
        recv(t, parent);
    if leaf
        t + v;
    else {
        send(t, leftChild);
        send(t+m, rightChild);
    }

Fish and shark problem

Dynamic interconnection networks

crossbar switch: flexible (any permutation, no message contention), expensive (n^2 switches).

bus: inexpensive, bottle neck (message contention), cache coherence.

multistage: omega network (message contention, less expensive than crossbar switch).

Message Passing

A message passing parallel program is usually written in a conventional sequential language such as Fortran or C or C++ with a communication library. We use MPI (message passing interface) as the communication library. Any MPI program starts with

	int MPI_Init(
		int*	argc,
		char**	argv[])

and ends with

	int MPI_Finalize(void)

A short tutorial is available.

One-to-one communication

blocking send and receive

	int MPI_Send(
		void*		buf	/* send buffer */,
		int		count	/* number of entries */,
		MPI_Datatype	dtype	/* datatype of entries */,
		int		dest	/* destination node rank */,
		int		tag	/* message tag */,
		MPI_Comm	comm	/* communicator */)

Some predefined MPI datatypes:

	MPI_CHAR		signed char
	MPI_INT			signed int
	MPI_UNSIGNED		unsigned int
	MPI_FLOAT		float
	MPI_DOUBLE		double
	MPI_BYTE

	int MPI_Recv(
		void*           buf     /* recv buffer */,
                int             count   /* number of entries */,
                MPI_Datatype    dtype   /* datatype of entries */,
                int             source  /* source node rank */,
                int             tag     /* message tag */,
                MPI_Comm        comm    /* communicator */,
		MPI_Status*	status  /* return status */)

Wildcards such as MPI_ANY_TAG, MPI_ANY_SOURCE can be used.

The above MPI_Send and MPI_Recv are synchronous or blocking functions. Specifically, if a system has no buffers, the buffer passed to MPI_Send cannot be overwritten until the data has been copied over to the destination processor, i.e., until MPI_Recv is called. This can cause deadlock. For example, if we want two processors P0 and P1 to pass messages circularly and write the following code:

	MPI_Send(senbuf, ...);
	MPI_Recv(recvbuf, ...);

then both processors execute this code in sequence. Consequently, MPI_Send never returns. A solution is

	if (myRank==0)
	    MPI_Send(sendbuf, ...);
	    MPI_Recv(recvbuf, ...);
	else
	    MPI_Recv(recvbuf, ...);
	    MPI_Send(sendbuf, ...);
	end;

We can also let MPI take care of the problem. The function MPI_Sendrecv can try to arrange send and recv so that no deadlock occurs. MPI also provides nonblocking communication functions MPI_Isend (immediate send) and MPI_Irecv.

Collective communication

Broadcast (one-to-all, fan-out)

	int MPI_Bcast(
		void*		buf	/* message buffer */,
		int		count	/* number of entries */,
		MPI_Datatype	dtype	/* datatype of entries */,
		int		root	/* broadcast root */,
		MPI_Comm	comm	/* communicator */)

Reduce (all-to-one, fan-in)

	int MPI_Reduce(
		void*		operand	/* send buffer */,
		void*		result	/* recv buffer */,
		int		count	/* number of entries */,
		MPI_Datatype    dtype   /* datatype of entries */,
		MPI_Op		op	/* operation */,
                int             root    /* broadcast root */,
                MPI_Comm        comm    /* communicator */)

Some predefined operations:

	MPI_MAX		maximum
	MPI_MIN		minimum
	MPI_SUM		sum

In MPI the send buffer and recv buffer must be different.

Scatter and Gather

	int MPI_Scatter(
		void*           sendbuf   /* send buffer */,
		int		sendcount /* number of send entries */,
		MPI_Datatype    sendtype  /* datatype of entries */,
                void*           recvbuf   /* recv buffer */,
                int             recvcount /* number of recv entries */,
                MPI_Datatype    recvtype  /* datatype of entries */,
                int             root    /* broadcast root */,
                MPI_Comm        comm    /* communicator */)

        int MPI_Gather(
                void*           sendbuf   /* send buffer */,
                int             sendcount /* number of send entries */,
                MPI_Datatype    sendtype  /* datatype of entries */,
                void*           recvbuf   /* recv buffer */,
                int             recvcount /* number of recv entries */,
                MPI_Datatype    recvtype  /* datatype of entries */,
                int             root    /* broadcast root */,
                MPI_Comm        comm    /* communicator */)