mpich icon indicating copy to clipboard operation
mpich copied to clipboard

Performance issue on Slingshot network

Open rgayatri23 opened this issue 11 months ago • 3 comments

Hi,

I have been running mpich/4.3.0 on Perlmutter which has the slingshot network (SS11).

I tested the alltoall osu benchmark comparing the average latency observed with mpich/4.3.0 and the default MPI library (cray-mpich) on the machine and I see a significant performance slowdowns with mpich.

For example, average latencies for each are shown below. This is 16 nodes run with 4 tasks per node where each task has a GPU buffer.

Size (bytes) mpich Cray-mpich
1 146.84 1392.51
2 145.79 1838.57
4 145.19 1944.91
8 148.55 1933.72
16 148.21 1961.74
32 144.99 1999.52
64 151.31 1993.13
128 159.65 1996.37
256 163.22 2047.14
512 175.57 1381.85
1024 129.66 1384.73
2048 129.04 1402.66
4096 132.9 1447.07
8192 287.9 1497.81
16384 297.13 1660.14
32768 226.47 1814.38
65536 395.63 2787.03
131072 753.19 3692.19
262144 1577.04 5530.24
524288 3615.98 8515.6
1048576 6751.7 14981.21

I would like to know if there are any configure options that I am missing that can be used to improve the performance or any compile/runtime flags that can be optimized.

Here are the configure options I used when building mpich

 ./configure --prefix=$install_path --enable-fast=O2 --with-pm=no --with-xpmem=/$path_to_xpmem --with-wrapper-dl-type=rpath --enable-threads=multiple --enable-shared=yes --enable-static=no --with-namepublisher=file --with-libfabric=$path_to_libfabric --with-device=ch4:ofi --with-ch4-shmmods=posix,xpmem --enable-thread-cs=per-vci --with-cuda=$path_to_cuda CPPFLAGS=-I$path_to_pmi CC=gcc CFLAGS= CXX=g++ FC=gfortran FCFLAGS=-fallow-argument-mismatch F77=gfortran FFLAGS=-fallow-argument-mismatch MPICHLIB_CFLAGS=-fPIC MPICHLIB_CXXFLAGS=-fPIC MPICHLIB_FFLAGS=-fPIC MPICHLIB_FCFLAGS=-fPIC

I am using cuda/12.4 and I am not adding anything specific compile/runtime options when I ran the above tests.

rgayatri23 avatar Apr 09 '25 18:04 rgayatri23

Try set MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1

hzhou avatar Apr 09 '25 19:04 hzhou

Thanks. That helped bridging the gap between cray-mpich and mpich from size 512 bytes and above. Is there a protocol switch that happens at that size?

rgayatri23 avatar Apr 09 '25 22:04 rgayatri23

Thanks. That helped bridging the gap between cray-mpich and mpich from size 512 bytes and above. Is there a protocol switch that happens at that size?

We don't but probably we should. ~Is it ROCm on Perlmutter?~ Never mind. You said cuda.

hzhou avatar Apr 09 '25 22:04 hzhou