(<< back)

Benchmarking Using a 2D Finite Difference Code

  1. Benchmarking
  2. Results
Presented here is a 2D finite difference code that is pretty handy for testing and benchmarking clusters. There are some reasons why it is handy:
  1. By controlling the "height" and "width" of the grid, one may move between a communications bound execution and a computationally bound execution.
  2. One may compile this code once for a platform, and control run parameters easily using command line arguments; it is idea for inclusion into script
  3. It should be pretty simple to instrument using various profiling tools.

Overview

This code is a parallel (using MPI) implementation of 2d heat conduction, finite difference over a rectangular domain using the following methods:
  • Jacobi
  • Gauss-Seidel
  • SOR
The communication scheme uses shared, or "ghost", rows that are used by adjacent processes. Assuming fixed width, this scheme shows linear speed up when the ratio #rows/#procs close to 1, but performance degrades consistently as the row distribution gets closer to 1 row per processor. This is due to communication overhead. The point is that the communication scheme is not optimized for any of the methods used here. This scheme, however allows one to control how the problem is bounded in terms of performance. Once one has reached a single row per, there is some width (-w) that will allow the computational time requirements to reduce the affects of communication overhead. In other words, more time will be spent computing than communicating.

It should be noted that this is one of many types of communication schemes. This implementation is not meant to be better than any particular one for testing a cluster environment, but it is useful nonetheless.

Platforms

The code has been run on various platforms. As far as I know it is standard with minimal standard library dependencies. In other words, if there is a C compiler and MPI libs, then it should compile and run.

Getting and Compiling 2dheat

  % svn co https://svn.loni.org/repos/2dheat/trunk 2dheat # LONI login required
If that doesn't work, download it here, though this most likely will not be current.

Compiling is simple. It requires an MPI library and a C compiler. The example below uses an MPI library wrapper common to most Linux clusters (using MPICH, etc):
  % mpicc -lm 2dheat.c -o 2dheat.x
The file "./compile.sh" is set up to compile using "mpicc" or "mpcc_r" (AIX), whichever is found first.

Usage

These options have been designed with the thought of incorporating the executable into a script that iterates over various grid dimensions and number of processors to get an accurate picture of what is happening on the cluster being tested.
-h [1-9][\0-9]* : height of grid (number of nodes); default is 50
-w [1-9][\0-9]* : width of grid (in number of nodes); default is 50
-m [123]        : 1 = Jacobi, 2 = Gauss-Seidel, 3 = SOR
-e [\.1-9][\0-9]* : convergence criteria; default is 0.1
-t              : output time to convergence only (in seconds); overidden by -v
-v              : full verbose output - task assignment, L2 norm for each iteration, etc
Example:
  % mpirun -np 64 ./2dheat.x -h 1000 -w 1000 -t
This says run 2dheat on a 1000x1000 node grid and report the time in seconds at the end.

Implementation Details and Publications

Benchmarking

This code has been adapted to be used as benchmarking tool. Command line arguments allow one to control the size of the domai n (width and height), the solved user, the convergence criteria, as well as various levels of verbosity.

There are several tunable parameters that may control certain aspects of the execution. This is important because with out the ability to affect the width of the grid, this code becomes communications bound as the number of rows/proc goes to 1. This means that at some point the amount of communication will dominate the time to solution. By controlling the width of the grid, one may increase the number of computations per row. The following highlights some parameters of control.

To control:
  • Rows per CPU, increase the HEIGHT with the "-h" flag
  • Amount of computations per CPU, increase the WIDTH with the "-w" flag
  • Increase or decrease time to convergence, adjust the method used (-m) or the convergence criteria (-e); this essentially allows one to control run time.
The following script iterates over width, height, and number of processors:
#!/bin/sh

METHOD=3
EPSILON=3.0

HEIGHT="164 256 512 1024" # 1024 2048 4096"
WIDTH="1 2 4 8 16 32 64 128 256 512" # 1024 2048 4096"
PROCS="1 2 4 8 16 32 64 96" # 64 128 256 512 1024"

echo "$METHOD $EPSILON"
echo "Widths $WIDTH"
echo "Heights $HEIGHT"
echo "Procs $PROCS"
echo
echo "    W     H   P   H/P        W/H       Time(s)   Spd Up    %Eff"
for w in ${WIDTH}; do
  for h in ${HEIGHT}; do
    serial=0
    for p in ${PROCS}; do
      for interation in 1 2 3 4 5; do
        out=`mpirun -np ${p} ../bin/2dheat.x -t -w ${w} -h ${h} -m ${METHOD} -e ${EPSILON}`
        # capture serial time
        if [ 1 -eq ${p} ]; then
          serial=$out
        fi
        # calculate speed up
        s=`perl -e "print ($serial / $out)"`
        # calculate efficiency
        e=`perl -e "print ($s / $p)"`
        # rows per cpu
        r=`perl -e "print ($h / $p)"`
        # w / h
        c=`perl -e "print ($w / $p)"`
        printf "%5d %5d %3d %10.5f %10.5f %15.9f %15.9f %15.9f\n" $w $h $p $r $c $out $s $e
      done
    done
  done
done
The script outputs the specific details in an easy to read and analyze format:
Widths 1 2 4 8 16 32 64 128 256 512
Heights 164 256 512 1024
Procs 1 2 4 8 16 32 64 96

    W     H   P   H/P        W/H       Time(s)   Spd Up    %Eff
    1   164   1  164.00000    1.00000     0.000067000     1.000000000     1.000000000
    1   164   1  164.00000    1.00000     0.000084000     1.000000000     1.000000000
    1   164   1  164.00000    1.00000     0.000076000     1.000000000     1.000000000
    1   164   1  164.00000    1.00000     0.000067000     1.000000000     1.000000000
    1   164   1  164.00000    1.00000     0.000067000     1.000000000     1.000000000
    1   164   2   82.00000    0.50000     0.000057000     1.175438596     0.587719298
    1   164   2   82.00000    0.50000     0.000055000     1.218181818     0.609090909
    1   164   2   82.00000    0.50000     0.000054000     1.240740741     0.620370370
    1   164   2   82.00000    0.50000     0.000052000     1.288461538     0.644230769
    1   164   2   82.00000    0.50000     0.000052000     1.288461538     0.644230769
    1   164   4   41.00000    0.25000     0.000077000     0.870129870     0.217532468
    1   164   4   41.00000    0.25000     0.000082000     0.817073171     0.204268293
    1   164   4   41.00000    0.25000     0.000072000     0.930555556     0.232638889
    1   164   4   41.00000    0.25000     0.000079000     0.848101266     0.212025316
    1   164   4   41.00000    0.25000     0.000079000     0.848101266     0.212025316
    1   164   8   20.50000    0.12500     0.000240000     0.279166667     0.034895833
    1   164   8   20.50000    0.12500     0.000225000     0.297777778     0.037222222
    1   164   8   20.50000    0.12500     0.000234000     0.286324786     0.035790598
    1   164   8   20.50000    0.12500     0.000219000     0.305936073     0.038242009
    1   164   8   20.50000    0.12500     0.000247000     0.271255061     0.033906883
    1   164  16   10.25000    0.06250     0.000321000     0.208722741     0.013045171
    1   164  16   10.25000    0.06250     0.000335000     0.200000000     0.012500000
    1   164  16   10.25000    0.06250     0.000314000     0.213375796     0.013335987
    1   164  16   10.25000    0.06250     0.000321000     0.208722741     0.013045171
    1   164  16   10.25000    0.06250     0.000331000     0.202416918     0.012651057
    1   164  32    5.12500    0.03125     0.001415000     0.047349823     0.001479682
    1   164  32    5.12500    0.03125     0.001426000     0.046984572     0.001468268
    1   164  32    5.12500    0.03125     0.001405000     0.047686833     0.001490214
    1   164  32    5.12500    0.03125     0.001428000     0.046918768     0.001466211
    1   164  32    5.12500    0.03125     0.001413000     0.047416844     0.001481776
    1   164  64    2.56250    0.01562     0.001055000     0.063507109     0.000992299
  ...
There is a script in scripts/bench.sh. To use it, adjust the number of procs to use, height, and width. When running, one may use tee to view the output and record it simulateously.
% ./bench.sh | tee bench.out

Platform Results

The following is a list of platforms and the results. The script used to run the results is included. In order for results submitted by others to be included, one must provide the result file and the script used to run the benchmark.
  • eric.loni.org [results] [script]
  • oliver.loni.org [results] [script]

Future

Regarding this code, what combinations of number of processors, heights, and widths will yield some good measure or view of a cluster's capabilities remains an open question. Furthermore, how can this information be visualized to provide a picture of what is going on? One thought is to whip up a simple gnuplot script (or set of them) to generate a series of plots.

Also, there is nothing saying that the methods employed in this code have to solved a 2d heat conduction equation - i.e., arbitrary methods may be employed to create more computationally intense kernels.

It would also be helpful to provide performance characteristics of the code to provide some ways to compare various clusters and architectures.

(<< back)

Copyright © 2004-2009. All Rights Reserved. The statements and opinions included in these pages are those of me only. Any statements and opinions included in these pages are not those of Louisiana State University or the LSU Board of Supervisors.