We've all coded matrix multiply at least once in our lives (and soon at least twice). Here is the simplest algorithm to compute C = C + A*B, where all matrices are n-by-n:
for i=1 to n for j=1 to n for k=1 to n C(i,j) = C(i,j) + A(i,k)*B(k,j) end for end for end forUnfortunately, this algorithm is oblivious to the computer's memory hierarchy, and optimizing compilers provide limited solace. Consider the the SUN compiler optimizations which you can read about (click here ). At the bottom (section D), there is a discussion of performance tuning and reference to a libfast.a library which can be linked to using
cc file.c -fast -lfast -xlic_lib=sunperfAlso SUN has a performance library that is included with the compiler which is based on the BLAS libraries. There is a BLAS3 function dgemm writen in Fortran for matrix-matrix multiplication.
If you didn't want to pay for SUN performance library, you might look around on the WWW in the Netlib Repository of public-domain software to find if there is a public-domain matrix multiply optimized for SUN architectures. Alternatively, you can call BLAS routines in you C program and consider locality.
Here is some basic Sparc5 and Ultra5 architectural information:
System CPU ClkMHz Cache SPEC SPEC Name Type ext/in Ext+I/D 95int 95fp --------------- ----------------- ------ ----------- ----- ----- SPARCstation5 MicroSaprcII V8 170 16/16 3.32 2.91 Ultra5 UltraSPARCIIi V9 270 256E+16/16 9.5 10.1More architectural information can be found from SPARC5 and Ultra5
For this assignment, we will be coding an optimized memory-hierarchy-cognizant matrix-matrix multiply routine in C. For simplicity, we will only require square matrices (worrying about the non-square case, essential for a good library code, can be a bit time-consuming). The goal of the assignment is to:
We will work in teams of two people. Each team will create a single routine and submit it to the Professor. We'll all gather on Friday following the due date during the class with each team making a short 2-minute presentation about their routine, and why it should run fast. Then we will time the routine on an dedicated Ultra5, as well as make sure you get the right answer.
The following is the template for the test program and method for measuring performance.
#include <stdio.h> #include <time.h> loop(i) int i ; { int j; clock_t start; double clicks; /* ** Get start time */ start = clock(); /* ** body of the routine. Optimization will kill this so no -O please */ for ( j = 0 ; j < i ; j++ ) { } /* ** Get the time used */ clicks = clock() - start; /* convert the time to seconds and ** Print out the actual number of seconds used */ printf("%.3f seconds were used\n",clicks/CLOCK_PER_SECOND); }You are welcome to discuss your optimization strategies with other teams. You should, however, keep in mind that this is a competition -- any hints might distort the contest results.
Here is the routine interface we will use.
/* ** mul_mfmf_mf: ** Optimized matrix multiply routine. ** ** Parameters: ** int matdim: the matrix dimension ** double * A: source1 matrix of size (matdim)x(matdim) ** double * B: source2 matrix of size (matdim)x(matdim) ** double * C: destination matrix of size (matdim)x(matdim) ** ** Operation: ** C = C + A*B; ** */ void mul_mfmf_mf( int matdim, const double *A, const double *B, double *C);Assume an ANSI compliant C compiler and that the arrays are stored in row-major order. This means that the following loop will access memory locations in increasing consecutive order:
for (i=0;i < matdim;i++) for (j=0;j < matdim;j++) access(A[i*matdim+j]);
For reference, The BLAS directory on Netlib is located at http://www.netlib.org/blas/index.html. A paper describing the interface in that directory is located at http://www.netlib.org/blas/blasqr.ps. And a BLAS matrix multiply in fortran can be found at http://www.netlib.org/blas/dgemm.f.
A good set of papers on the general issues of blocking algorithms for memory hierarchies is here. Look at the README file for a list of titles of the papers.
For handin: a report (maximum four pages) explaining
Your marks depend on how close you can get to SUN tuned library funcion.