SE4FO3 Assignment 1, 2001

Assignment 1: Due February 15, Thursday, end of class.

For this first assignment, we will have a little programming contest. Although the assignment isn't about ``parallelism'' per se, it involves some of the more crucial issues in high performance algorithm design, namely dealing with the memory hierarachy.

We've all coded matrix multiply at least once in our lives (and soon at least twice). Here is the simplest algorithm to compute C = C + A*B, where all matrices are n-by-n:

      for i=1 to n
          for j=1 to n
             for k=1 to n
                 C(i,j) = C(i,j) + A(i,k)*B(k,j)
             end for
          end for
       end for

Unfortunately, this algorithm is oblivious to the computer's memory hierarchy, and optimizing compilers provide limited solace. Consider the the SUN compiler optimizations which you can read about (click here ). At the bottom (section D), there is a discussion of performance tuning and reference to a libfast.a library which can be linked to using

        cc file.c -fast -lfast -xlic_lib=sunperf

Also SUN has a performance library that is included with the compiler which is based on the BLAS libraries. There is a BLAS3 function dgemm writen in Fortran for matrix-matrix multiplication.

If you didn't want to pay for SUN performance library, you might look around on the WWW in the Netlib Repository of public-domain software to find if there is a public-domain matrix multiply optimized for SUN architectures. Alternatively, you can call BLAS routines in you C program and consider locality.

Here is some basic Sparc5 and Ultra5 architectural information:

System           CPU              ClkMHz   Cache     SPEC  SPEC
Name             Type             ext/in   Ext+I/D   95int 95fp
--------------- ----------------- ------ ----------- ----- -----
SPARCstation5   MicroSaprcII V8   170    16/16        3.32  2.91
Ultra5          UltraSPARCIIi V9  270    256E+16/16   9.5  10.1

More architectural information can be found from SPARC5 and Ultra5

For this assignment, we will be coding an optimized memory-hierarchy-cognizant matrix-matrix multiply routine in C. For simplicity, we will only require square matrices (worrying about the non-square case, essential for a good library code, can be a bit time-consuming). The goal of the assignment is to:

write a matrix routine C = C + A*B for square matrices,

get as close to peak performance as possible, while still getting the right answer, and

get to know each other.

We will work in teams of two people. Each team will create a single routine and submit it to the Professor. We'll all gather on Friday following the due date during the class with each team making a short 2-minute presentation about their routine, and why it should run fast. Then we will time the routine on an dedicated Ultra5, as well as make sure you get the right answer.

The following is the template for the test program and method for measuring performance.

#include <stdio.h>
#include <time.h>

loop(i)
int i ;
{
    int j;

    clock_t start;
    
    double clicks;
/*    
**  Get start time
*/
    start = clock();
/*
**  body of the routine. Optimization will kill this so no -O please
*/
    for ( j = 0 ; j < i ; j++ ) {
    }
/*
**  Get the time used
*/
    clicks = clock() - start;

/*  convert the time to seconds and    
**  Print out the actual number of seconds used
*/
    printf("%.3f seconds were used\n",clicks/CLOCK_PER_SECOND);
}

You are welcome to discuss your optimization strategies with other teams. You should, however, keep in mind that this is a competition -- any hints might distort the contest results.

Here is the routine interface we will use.

/*
**  mul_mfmf_mf:
**    Optimized matrix multiply routine.
**
**    Parameters:
**      int matdim: the matrix dimension
**      double * A: source1 matrix of size (matdim)x(matdim)
**      double * B: source2 matrix of size (matdim)x(matdim)
**      double * C: destination matrix of size (matdim)x(matdim)
**
**    Operation:
**      C = C + A*B;
**
*/
void
mul_mfmf_mf(
        int matdim,
        const double *A,
        const double *B,
        double *C);

Assume an ANSI compliant C compiler and that the arrays are stored in row-major order. This means that the following loop will access memory locations in increasing consecutive order:

  for (i=0;i < matdim;i++)
    for (j=0;j < matdim;j++)
       access(A[i*matdim+j]);

For reference, The BLAS directory on Netlib is located at http://www.netlib.org/blas/index.html. A paper describing the interface in that directory is located at http://www.netlib.org/blas/blasqr.ps. And a BLAS matrix multiply in fortran can be found at http://www.netlib.org/blas/dgemm.f.

A good set of papers on the general issues of blocking algorithms for memory hierarchies is here. Look at the README file for a list of titles of the papers.

For handin: a report (maximum four pages) explaining

Design: Why should your program run fast?

Testing: How can you be sure the answer is right?

Performance: Compare your program with the straightforward algorithm and SUN performance library. To make our measurements independent of problem size, we use MFLOPS (million floating-point operations per second). Suppose the matrix size is n-by-n, then the floating-point operation count is 2*n^3. If it takes s seconds, then MFLOPS = (2*n*n*n)/(1000000*s).

Your marks depend on how close you can get to SUN tuned library funcion.