Communication

References

  • IPC: 2.3--2.10, 3.1--3.9
  • MPI
  • Demmel: L3, L5, L9(1)
  • Model

    The communication time for one packet is modeled as

    	t_{comm} = t_s + (m*t_w + t_h)*l
    
    where t_s is the startup time (adding header, establishing connection, etc); t_w is the transfer time (seconds per word, its reciprocal is the bandwidth, i.e., words per second); t_h is the hop time (time for the header to travel from processor to its adjacent processor). Usually it is small and negligible; m is the message size (words); l is the number of links between two communicating processors. Usually, for direct connection, we use
    	t_{comm} = t_s + m*t_w
    
    The startup time t_s is usually much larger than the transfer time t_w. So, it is more efficient to send few and long messages than to send many short messages. For example, over an Ethernet, a cluster of workstations running PVM, t_s is around 10^2 to 10^5 cycles and t_w is around 1e-9 to 1e-7 seconds per byte.

    Routing mechanism

  • Store-and-forward
    	t_{comm} = t_s + (m*t_w + t_h)*l ~ t_s + m*t_w*l
    
  • Cut-through
    	t_{comm} = t_s + l*t_h + m*t_w ~ t_s + m*t_w
    
  • Static interconnection networks

  • square p-processor wraparound mesh. Each processor has four wires, thus a total of 2*p wires and the average number of hops is sqrt(p)/2.
  • hypercube. Each processor has log(p) wires, a total of (p*log(p))/2 wires and the average number of hops is log(p)/2.

    For the same cost (number of wires), mesh has log(p)/4 wires per link. To compare the communication time,

    hypercube:

    	t_{comm} = t_s + m*t_w
    
    mesh:
            t_{comm} = t_s + m*t_w / (log(p)/4)
    
    For p>16, mesh is better (cost/performance) than hypercube.
  • embedding mesh, ring, tree in hypercube
  • Performance analysis (broadcast)

    Store-and-forward

    Ring, we can broadcast in both directions.

            t_{bcast} = (t_s + m*t_w)*p/2
    

    Mesh (2D), we first broadcast in a row then in all columns.

            t_{bcast} = (t_s + m*t_w)*sqrt(p)
    

    Hypercube, we first broadcast to the node differs in the m.s.b., then to the nodes differ in the second m.s.b, ..., finally to the nodes differ in the l.s.b.

            t_{bcast} = (t_s + m*t_w)*log(p)
    

    Cut-through

    Ring, first P0 sends to P(p/2), then both P0 and P(p/2) send to P(p/4) and P(3p/4), ...

            t_{bcast} = (t_s + m*t_w)*log(p)
    

    Mesh (2D), first broadcast in a row, then in all columns simultaneously.

            t_{comm} = 2*(t_s + m*t_w)*log(sqrt(p))
    

    Hypercube, broadcast in a balanced binary tree

            t_{comm} = (t_s + m*t_w)*log(p)
    

    Example

  • Parallel prefix

    The add-scan of a sequence: x(0), x(1), ..., x(p-1) is:

        y(0) = x(0)
        y(1) = x(0) + x(1)
        ...
        y(p-1) = x(0) + x(1) + ... + x(p-1)
    
    The operation "+" can be replaced by any of "*", "max", or "min".

    A lower bound Any function of n parameters (e.g., x(0) + x(1) + ... + x(n-1)) require at least log(n) operations (binary). In fact, the result returned by the function depends on at most 2 parameters. It takes one step. Each of the two operands depends on at most 2 parameters. Thus in two steps, the result depends on at most four parameters. In general, a function computed in k steps has at most 2^k parameters.

    A parallel algorithm for computing y(0) to y(p-1) in 2*log_2(p) - 1 steps. Assume a binary tree network of processors (actuall a linear array is sufficient).

    Phase 1 (up-the-tree)

        if leaf node
            send(v, parent);
        else {
            recv(m, leftChild);
            recv(tmp, rightChild);
            if not root
                send(m+tmp, parent);
        }
    
    After phase 1, each interior node has m the sum of all the values (v) in the leaves of its left subtree.

    Phase 2 (down-the-tree)

        if root
            t = 0;
        else
            recv(t, parent);
        if leaf
            t + v;
        else {
            send(t, leftChild);
            send(t+m, rightChild);
        }
    
  • Fish and shark problem
  • Dynamic interconnection networks

  • crossbar switch: flexible (any permutation, no message contention), expensive (n^2 switches).
  • bus: inexpensive, bottle neck (message contention), cache coherence.
  • multistage: omega network (message contention, less expensive than crossbar switch).
  • Message Passing

    A message passing parallel program is usually written in a conventional sequential language such as Fortran or C or C++ with a communication library. We use MPI (message passing interface) as the communication library. Any MPI program starts with

    	int MPI_Init(
    		int*	argc,
    		char**	argv[])
    
    and ends with
    	int MPI_Finalize(void)
    
    A short tutorial is available.

    One-to-one communication

  • blocking send and receive
    	int MPI_Send(
    		void*		buf	/* send buffer */,
    		int		count	/* number of entries */,
    		MPI_Datatype	dtype	/* datatype of entries */,
    		int		dest	/* destination node rank */,
    		int		tag	/* message tag */,
    		MPI_Comm	comm	/* communicator */)
    
    Some predefined MPI datatypes:
    	MPI_CHAR		signed char
    	MPI_INT			signed int
    	MPI_UNSIGNED		unsigned int
    	MPI_FLOAT		float
    	MPI_DOUBLE		double
    	MPI_BYTE
    
    	int MPI_Recv(
    		void*           buf     /* recv buffer */,
                    int             count   /* number of entries */,
                    MPI_Datatype    dtype   /* datatype of entries */,
                    int             source  /* source node rank */,
                    int             tag     /* message tag */,
                    MPI_Comm        comm    /* communicator */,
    		MPI_Status*	status  /* return status */)
    
    Wildcards such as MPI_ANY_TAG, MPI_ANY_SOURCE can be used.

    The above MPI_Send and MPI_Recv are synchronous or blocking functions. Specifically, if a system has no buffers, the buffer passed to MPI_Send cannot be overwritten until the data has been copied over to the destination processor, i.e., until MPI_Recv is called. This can cause deadlock. For example, if we want two processors P0 and P1 to pass messages circularly and write the following code:

    	MPI_Send(senbuf, ...);
    	MPI_Recv(recvbuf, ...);
    
    then both processors execute this code in sequence. Consequently, MPI_Send never returns. A solution is
    	if (myRank==0)
    	    MPI_Send(sendbuf, ...);
    	    MPI_Recv(recvbuf, ...);
    	else
    	    MPI_Recv(recvbuf, ...);
    	    MPI_Send(sendbuf, ...);
    	end;
    
    We can also let MPI take care of the problem. The function MPI_Sendrecv can try to arrange send and recv so that no deadlock occurs. MPI also provides nonblocking communication functions MPI_Isend (immediate send) and MPI_Irecv.
  • Collective communication

  • Broadcast (one-to-all, fan-out)
    	int MPI_Bcast(
    		void*		buf	/* message buffer */,
    		int		count	/* number of entries */,
    		MPI_Datatype	dtype	/* datatype of entries */,
    		int		root	/* broadcast root */,
    		MPI_Comm	comm	/* communicator */)
    
  • Reduce (all-to-one, fan-in)
    	int MPI_Reduce(
    		void*		operand	/* send buffer */,
    		void*		result	/* recv buffer */,
    		int		count	/* number of entries */,
    		MPI_Datatype    dtype   /* datatype of entries */,
    		MPI_Op		op	/* operation */,
                    int             root    /* broadcast root */,
                    MPI_Comm        comm    /* communicator */)
    
    Some predefined operations:
    	MPI_MAX		maximum
    	MPI_MIN		minimum
    	MPI_SUM		sum
    
    In MPI the send buffer and recv buffer must be different.
  • Scatter and Gather
    	int MPI_Scatter(
    		void*           sendbuf   /* send buffer */,
    		int		sendcount /* number of send entries */,
    		MPI_Datatype    sendtype  /* datatype of entries */,
                    void*           recvbuf   /* recv buffer */,
                    int             recvcount /* number of recv entries */,
                    MPI_Datatype    recvtype  /* datatype of entries */,
                    int             root    /* broadcast root */,
                    MPI_Comm        comm    /* communicator */)
    
            int MPI_Gather(
                    void*           sendbuf   /* send buffer */,
                    int             sendcount /* number of send entries */,
                    MPI_Datatype    sendtype  /* datatype of entries */,
                    void*           recvbuf   /* recv buffer */,
                    int             recvcount /* number of recv entries */,
                    MPI_Datatype    recvtype  /* datatype of entries */,
                    int             root    /* broadcast root */,
                    MPI_Comm        comm    /* communicator */)