mpich Performance issue on Slingshot network

Hi,

I have been running mpich/4.3.0 on Perlmutter which has the slingshot network (SS11).

I tested the alltoall osu benchmark comparing the average latency observed with mpich/4.3.0 and the default MPI library (cray-mpich) on the machine and I see a significant performance slowdowns with mpich.

For example, average latencies for each are shown below. This is 16 nodes run with 4 tasks per node where each task has a GPU buffer.

Size (bytes)	mpich	Cray-mpich
1	146.84	1392.51
2	145.79	1838.57
4	145.19	1944.91
8	148.55	1933.72
16	148.21	1961.74
32	144.99	1999.52
64	151.31	1993.13
128	159.65	1996.37
256	163.22	2047.14
512	175.57	1381.85
1024	129.66	1384.73
2048	129.04	1402.66
4096	132.9	1447.07
8192	287.9	1497.81
16384	297.13	1660.14
32768	226.47	1814.38
65536	395.63	2787.03
131072	753.19	3692.19
262144	1577.04	5530.24
524288	3615.98	8515.6
1048576	6751.7	14981.21

I would like to know if there are any configure options that I am missing that can be used to improve the performance or any compile/runtime flags that can be optimized.

Here are the configure options I used when building mpich

 ./configure --prefix=$install_path --enable-fast=O2 --with-pm=no --with-xpmem=/$path_to_xpmem --with-wrapper-dl-type=rpath --enable-threads=multiple --enable-shared=yes --enable-static=no --with-namepublisher=file --with-libfabric=$path_to_libfabric --with-device=ch4:ofi --with-ch4-shmmods=posix,xpmem --enable-thread-cs=per-vci --with-cuda=$path_to_cuda CPPFLAGS=-I$path_to_pmi CC=gcc CFLAGS= CXX=g++ FC=gfortran FCFLAGS=-fallow-argument-mismatch F77=gfortran FFLAGS=-fallow-argument-mismatch MPICHLIB_CFLAGS=-fPIC MPICHLIB_CXXFLAGS=-fPIC MPICHLIB_FFLAGS=-fPIC MPICHLIB_FCFLAGS=-fPIC

I am using cuda/12.4 and I am not adding anything specific compile/runtime options when I ran the above tests.

Apr 09 '25 18:04 rgayatri23

Try set MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1

Apr 09 '25 19:04 hzhou

Thanks. That helped bridging the gap between cray-mpich and mpich from size 512 bytes and above. Is there a protocol switch that happens at that size?

Apr 09 '25 22:04 rgayatri23

Thanks. That helped bridging the gap between cray-mpich and mpich from size 512 bytes and above. Is there a protocol switch that happens at that size?

We don't but probably we should. ~Is it ROCm on Perlmutter?~ Never mind. You said cuda.

Apr 09 '25 22:04 hzhou