Performance issue on Slingshot network
Hi,
I have been running mpich/4.3.0 on Perlmutter which has the slingshot network (SS11).
I tested the alltoall osu benchmark comparing the average latency observed with mpich/4.3.0 and the default MPI library (cray-mpich) on the machine and I see a significant performance slowdowns with mpich.
For example, average latencies for each are shown below. This is 16 nodes run with 4 tasks per node where each task has a GPU buffer.
| Size (bytes) | mpich | Cray-mpich |
|---|---|---|
| 1 | 146.84 | 1392.51 |
| 2 | 145.79 | 1838.57 |
| 4 | 145.19 | 1944.91 |
| 8 | 148.55 | 1933.72 |
| 16 | 148.21 | 1961.74 |
| 32 | 144.99 | 1999.52 |
| 64 | 151.31 | 1993.13 |
| 128 | 159.65 | 1996.37 |
| 256 | 163.22 | 2047.14 |
| 512 | 175.57 | 1381.85 |
| 1024 | 129.66 | 1384.73 |
| 2048 | 129.04 | 1402.66 |
| 4096 | 132.9 | 1447.07 |
| 8192 | 287.9 | 1497.81 |
| 16384 | 297.13 | 1660.14 |
| 32768 | 226.47 | 1814.38 |
| 65536 | 395.63 | 2787.03 |
| 131072 | 753.19 | 3692.19 |
| 262144 | 1577.04 | 5530.24 |
| 524288 | 3615.98 | 8515.6 |
| 1048576 | 6751.7 | 14981.21 |
I would like to know if there are any configure options that I am missing that can be used to improve the performance or any compile/runtime flags that can be optimized.
Here are the configure options I used when building mpich
./configure --prefix=$install_path --enable-fast=O2 --with-pm=no --with-xpmem=/$path_to_xpmem --with-wrapper-dl-type=rpath --enable-threads=multiple --enable-shared=yes --enable-static=no --with-namepublisher=file --with-libfabric=$path_to_libfabric --with-device=ch4:ofi --with-ch4-shmmods=posix,xpmem --enable-thread-cs=per-vci --with-cuda=$path_to_cuda CPPFLAGS=-I$path_to_pmi CC=gcc CFLAGS= CXX=g++ FC=gfortran FCFLAGS=-fallow-argument-mismatch F77=gfortran FFLAGS=-fallow-argument-mismatch MPICHLIB_CFLAGS=-fPIC MPICHLIB_CXXFLAGS=-fPIC MPICHLIB_FFLAGS=-fPIC MPICHLIB_FCFLAGS=-fPIC
I am using cuda/12.4 and I am not adding anything specific compile/runtime options when I ran the above tests.
Try set MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
Thanks. That helped bridging the gap between cray-mpich and mpich from size 512 bytes and above. Is there a protocol switch that happens at that size?
Thanks. That helped bridging the gap between cray-mpich and mpich from size 512 bytes and above. Is there a protocol switch that happens at that size?
We don't but probably we should. ~Is it ROCm on Perlmutter?~ Never mind. You said cuda.