mpich Wrong data with MPI send/recv and pipelining on Intel GPUs

We're getting incorrect results in application code when using MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1 if the buffer size MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ isn't set large enough. Setting it larger seems to work, but MPI should still give correct results (with possible performance hit, or give an error) if it is not set large enough. The full code is fairly complicated, but I have a simple reproducer which can somewhat reproduce the issue. The reproducer can easily fail if the buffer size is set lower than the default, but it doesn't seem to fail for the default size on up to 8 nodes. With a buffer size of 512k it fails easily on 4 nodes, and with 256k will fail regularly on 2 nodes.

Reproducer

sendrecvgpu.cc

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <sycl/sycl.hpp>

//const int nmesg = 2;
const int nmesg = 16;
//const int nmesg = 24;
//const int nmesg = 32;
//const int nrep = 1;
const int nrep = 1000;
//const int nrep = 10000;
//const int nrep = 20000;
const int nmin = 128*1024;
//const int nmax = 128*1024;
//const int nmin = 256*1024;
const int nmax = 256*1024;
//const int nmin = 2*1024*1024;
//const int nmax = 2*1024*1024;

void sendrecv(double *dest[], double *src[], int n) {
  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Request sreq[nmesg], rreq[nmesg];
  for(int i=0; i<nmesg; i++) {
    int k = 1 << i;
    int recv = (rank+k) % size;
    MPI_Irecv(dest[i], n, MPI_DOUBLE, recv, i, MPI_COMM_WORLD, &rreq[i]);
  }
  for(int i=0; i<nmesg; i++) {
    int k = 1 << i;
    int send = (rank+k*size-k) % size;
    MPI_Isend(src[i], n, MPI_DOUBLE, send, i, MPI_COMM_WORLD, &sreq[i]);
  }
  MPI_Waitall(nmesg, sreq, MPI_STATUS_IGNORE);
  MPI_Waitall(nmesg, rreq, MPI_STATUS_IGNORE);
}

int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);
  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  char name[MPI_MAX_PROCESSOR_NAME];
  int namelen;
  MPI_Get_processor_name(name, &namelen);
  //sycl::queue q{sycl::gpu_selector_v};
  sycl::platform plat{sycl::gpu_selector_v};
  auto devs = plat.get_devices();
  int ndev = devs.size();
  int devid = rank % ndev;
  printf("%s  rank %3i  device %2i\n", name, rank, devid);
  fflush(stdout);
  MPI_Barrier(MPI_COMM_WORLD);
  sycl::queue q{devs[devid]};
  double *src[nmesg], *srcg[nmesg], *dest[nmesg], *destg[nmesg];
  for(int i=0; i<nmesg; i++) {
    src[i] = (double*)malloc(nmax*sizeof(double));
    srcg[i] = (double*)sycl::malloc_device<double>(nmax, q);
    dest[i] = (double*)malloc(nmax*sizeof(double));
    destg[i] = (double*)sycl::malloc_device<double>(nmax, q);
#pragma omp parallel for
    for(int j=0; j<nmax; j++) {
      src[i][j] = i + j;
    }
  }

  int error = 0;
  int errort = 0;
  for(int n=nmin; n<=nmax; n*=2) {
    if(rank==0) printf("Testing n = %i ...", n);
    for(int rep=0; rep<nrep; rep++) {
      //sendrecv(dest, src, n);
      for(int i=0; i<nmesg; i++) {
	q.memcpy(srcg[i], src[i], n*sizeof(double));
	q.memset(destg[i], 0, n*sizeof(double));
      }
      q.wait();
      sendrecv(destg, srcg, n);
      for(int i=0; i<nmesg; i++) {
	q.memcpy(dest[i], destg[i], n*sizeof(double));
      }
      q.wait();
      for(int i=0; i<nmesg; i++) {
	for(int j=0; j<n; j++) {
	  if (dest[i][j] != src[i][j]) {
	    printf("\n  error %i dest[%i][%i] = %f expected %f\n", rep, i, j, dest[i][j], src[i][j]);
	    error++;
	    break;
	  }
	}
	if(error>0) break;
      }
      MPI_Allreduce(&error, &errort, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
      if (errort>0) break;
    }
    if(errort>0) {
      if (rank==0) printf(" %i errors.\n", errort);
      break;
    } else {
      if (rank==0) printf(" done.\n");
    }
  }
  MPI_Finalize();
}

mpicxx -fsycl -qopenmp sendrecvgpu.cc -o sendrecvgpu

export MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1 export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ=$((256*1024)) export ZE_FLAT_DEVICE_HIERARCHY=FLAT mpiexec -np 24 --ppn 12 ./sendrecvgpu

Sep 12 '24 21:09 jcosborn

I can reproduce this on Aurora with commit d79cd238209c787bbcbbe730f9b958afe4e852ac ~b3480ddfec1d9e98b06783aec97c082eadeca1a7~ (updating with test from newer commit) from main as well.

Sep 13 '24 18:09 colleeneb

Thanks for the reproducer. it appears in GPU pipelining, there is potentially scenarios that chunks are written into receive buffers out-of-order. I created a PR https://github.com/pmodels/mpich/pull/7182 to fix it.

Oct 19 '24 15:10 zhenggb72

I confirmed that the reproducer passes for module load mpich/opt/develop-git.204f8cd on Aurora (which includes PR #7182 ). @jcosborn if you have a chance to test out this module, it would be appreciated!

Nov 01 '24 14:11 colleeneb

I also confirmed this fixes the reproducer, however I now get hangs for some specific cases when running a full application with pipelining when not setting a larger buffer size. The cases seem to involve messages of different sizes, where some messages are much larger than the rest. I don't know the exact requirements yet and don't have a simple reproducer, but will keep trying to see if I can get one.

Nov 14 '24 18:11 jcosborn

I'm now getting hangs when running this test case with the newly compiled MPICH currently available in the alcf_kmd_val Aurora queue.

Jan 15 '25 19:01 jcosborn

I was able to reproduce the hang at 2 nodes with 1 process per node: MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1 MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ=$((256*1024)) ZE_FLAT_DEVICE_HIERARCHY=FLAT mpiexec -np 2 --ppn 1 ./sendrecvgpu

The backtrace of the two ranks is: rank 0:

(gdb) bt
#0  0x0000154b07fe0210 in ofi_mutex_unlock_noop () at src/common.c:988
#1  0x0000154b07ff5349 in ofi_genlock_unlock (lock=0x51984f0) at ./include/ofi_lock.h:394
#2  ofi_cq_read_entries (src_addr=0x0, count=<optimized out>, buf=<optimized out>, cq=0x5198460) at ./include/ofi_util.h:615
#3  ofi_cq_readfrom (cq_fid=0x5198460, buf=<optimized out>, count=<optimized out>, src_addr=0x0) at prov/util/src/util_cq.c:272
#4  0x0000154b2a5f620b in MPIDI_NM_progress ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#5  0x0000154b2a5f4f76 in MPIDI_progress_test ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#6  0x0000154b2a5f23e5 in MPIR_Waitall_state ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#7  0x0000154b2a5f2ba9 in MPIR_Waitall ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#8  0x0000154b2a44f9b5 in PMPI_Waitall ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#9  0x0000000000403030 in sendrecv (dest=0x7ffecd6fee70, src=0x7ffecd6fef70, n=131072) at t.cpp:36
#10 0x0000000000403743 in main (argc=1, argv=0x7ffecd6ff288) at t.cpp:80

rank 1:

#0  0x000014c273136fa0 in MPIDI_POSIX_eager_recv_begin ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#1  0x000014c2730783af in MPIDI_SHM_progress ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#2  0x000014c273077e8a in MPIDI_progress_test ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#3  0x000014c2730753e5 in MPIR_Waitall_state ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#4  0x000014c273075ba9 in MPIR_Waitall ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#5  0x000014c272ed29b5 in PMPI_Waitall ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#6  0x0000000000403030 in sendrecv (dest=0x7ffda14eea60, src=0x7ffda14eeb60, n=131072) at t.cpp:36
#7  0x0000000000403743 in main (argc=1, argv=0x7ffda14eee78) at t.cpp:80

I guess they are both in the Waitall waiting for something.

Jan 15 '25 21:01 colleeneb

I can reproduce the hang with the main branch, however our latest drop version works. Could you try the drop version that was installed on Aurora?

Feb 04 '25 17:02 zhenggb72

Yes, it works fine with the current software stack on Aurora, thanks. I'm still seeing another hang with pipelining that is dependent on the relative message sizes. I don't have a simple reproducer, but can reproduce it with an application test case. Here's a build script for the test:

#!/bin/bash
ml cmake

git clone -b feature/sycl https://github.com/lattice/quda
mkdir build && cd build

export QUDA_TARGET=SYCL
export CC=mpicc
export CXX=mpicxx
export QUDA_SYCL_TARGETS="intel_gpu_pvc"
export SYCL_LINK_FLAGS="$SYCL_LINK_FLAGS -fsycl-device-code-split=per_kernel"
export SYCL_LINK_FLAGS="$SYCL_LINK_FLAGS -fsycl-max-parallel-link-jobs=32"
export SYCL_LINK_FLAGS="$SYCL_LINK_FLAGS -flink-huge-device-code"

o="$o -DCMAKE_BUILD_TYPE=RELEASE"
o="$o -DQUDA_DIRAC_DEFAULT_OFF=ON"
o="$o -DQUDA_DIRAC_STAGGERED=ON"
o="$o -DQUDA_FAST_COMPILE_REDUCE=ON"
o="$o -DQUDA_FAST_COMPILE_DSLASH=ON"
o="$o -DQUDA_MPI=ON"
o="$o -DMPIEXEC_EXECUTABLE=`which mpiexec`"

cmake $o ../quda

make -O -j16 staggered_invert_test |& tee build.log

and the run script

#!/bin/bash
#PBS -l select=2
#PBS -l walltime=1:00:00
#PBS -A Catalyst
#PBS -q debug

hostname
if [ ! -z "$PBS_O_WORKDIR" ]; then
    cd $PBS_O_WORKDIR
fi
source ~/bin/setup-oneapi
module -t --redirect list |sort

export QUDA_ENABLE_TUNING=0
export QUDA_ENABLE_P2P=0
export QUDA_ENABLE_GDR=1
export ZE_FLAT_DEVICE_HIERARCHY=FLAT
export MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1
#export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ=$((8*1024*1024))

asq="--dslash-type asqtad --compute-fat-long false"
inv="--solve-type direct-pc --solution-type mat-pc --inv-type cg --matpc even-even"
par="--prec double --tol 1e-4 --mass 0.04 --niter 1000 --nsrc 3 --multishift 14"
geom="--dim 32 24 24 24 --gridsize 2 2 2 3 --rank-order row"
#geom="--dim 32 24 24 24 --gridsize 2 2 2 3"

mpiexec -np 24 --ppn 12 build/tests/staggered_invert_test $asq $inv $par $geom

As-is this will hang. Uncommenting the buffer size line, or swapping the 'geom' with the commented out one will make it run to completion. The "row" rank-order makes the messages passed between nodes be the same size as the larger of the messages passed within a node (this is the case that hangs). For the default row-order, the messages between nodes are the same size as the smaller of the messages within a node and it doesn't hang.

Feb 05 '25 02:02 jcosborn

@jcosborn please try the Intel provided drop: module load mpich/opt/4.2.3-intel

It seems to run with that version. I don't know what has changed in the default build.

Feb 05 '25 17:02 zhenggb72

Yes, it works with mpich/opt/4.2.3-intel.

Feb 05 '25 18:02 jcosborn

Just ran the reproducer on Aurora debug queue using fresh main branch -

39a60e165 02/12 11:31 Merge pull request #7281 from hzhou/2501_romio_fort

seems working:

sh test.sh
+ export MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1
+ MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1
+ export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ=262144
+ MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ=262144
+ export ZE_FLAT_DEVICE_HIERARCHY=FLAT
+ ZE_FLAT_DEVICE_HIERARCHY=FLAT
+ mpichversion
MPICH Version:      5.0.0a1
MPICH Release date: unreleased development copy
MPICH ABI:          0:0:0
MPICH Device:       ch4:ofi
MPICH configure:    --with-pm=no --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.20.1 --with-ze=/usr --disable-fortran --disable-romio --with-pmix=/usr
MPICH CC:           icx     -O2
MPICH CXX:          icpx   -O2
MPICH F77:          ifx
MPICH FC:           ifx
MPICH features:     threadcomm

+ mpirun -l -n 24 -ppn 12 ./sendrecvgpu
x4518c2s6b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 10: x4518c2s6b0n0  rank  10  device 10
x4518c2s6b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 0: x4518c2s6b0n0  rank   0  device  0
x4518c3s5b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 22: x4518c3s5b0n0  rank  22  device 10
x4518c2s6b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 1: x4518c2s6b0n0  rank   1  device  1
x4518c2s6b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 2: x4518c2s6b0n0  rank   2  device  2
x4518c3s5b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 23: x4518c3s5b0n0  rank  23  device 11
x4518c2s6b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 3: x4518c2s6b0n0  rank   3  device  3
x4518c3s5b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 12: x4518c3s5b0n0  rank  12  device  0
x4518c2s6b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 4: x4518c2s6b0n0  rank   4  device  4
x4518c2s6b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 5: x4518c2s6b0n0  rank   5  device  5
x4518c2s6b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 6: x4518c2s6b0n0  rank   6  device  6
x4518c2s6b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 7: x4518c2s6b0n0  rank   7  device  7
x4518c3s5b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 13: x4518c3s5b0n0  rank  13  device  1
x4518c2s6b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 8: x4518c2s6b0n0  rank   8  device  8
x4518c2s6b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 9: x4518c2s6b0n0  rank   9  device  9
x4518c2s6b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 11: x4518c2s6b0n0  rank  11  device 11
x4518c3s5b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 14: x4518c3s5b0n0  rank  14  device  2
x4518c3s5b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 15: x4518c3s5b0n0  rank  15  device  3
x4518c3s5b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 16: x4518c3s5b0n0  rank  16  device  4
x4518c3s5b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 17: x4518c3s5b0n0  rank  17  device  5
x4518c3s5b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 18: x4518c3s5b0n0  rank  18  device  6
x4518c3s5b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 19: x4518c3s5b0n0  rank  19  device  7
x4518c3s5b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 20: x4518c3s5b0n0  rank  20  device  8
x4518c3s5b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 21: x4518c3s5b0n0  rank  21  device  9
x4518c2s6b0n0.hostmgmt2518.cm.aurora.alcf.anl.gov 0: Testing n = 131072 ... done.
Testing n = 262144 ... done.

real    0m47.292s
user    0m0.005s
sys     0m0.006s

I'll try https://github.com/pmodels/mpich/issues/7139#issuecomment-2593954162 next...

hzhou@x4518c2s6b0n0:~/temp/debug_pipeline> MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1 MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ=$((
256*1024)) ZE_FLAT_DEVICE_HIERARCHY=FLAT time mpiexec -np 2 --ppn 1 ./sendrecvgpu
x4518c2s6b0n0  rank   0  device  0
x4518c3s5b0n0  rank   1  device  1
Testing n = 131072 ... done.
Testing n = 262144 ... done.
0.00user 0.00system 0:49.17elapsed 0%CPU (0avgtext+0avgdata 6704maxresident)k
0inputs+0outputs (0major+405minor)pagefaults 0swaps

Feb 12 '25 18:02 hzhou

Can I query what is the consensus view of the status of this issue?

Is there a known good MPI installed in some module on Aurora?

It seems the original correctness issue was fixed in some version(s), but that a second issue was raised terms of QUDA hangs? Is that second issue under study, or is it viewed as fixed?

In the meantime, I've implemented an alternate route in Grid to give good performance assuming pipelining can't be relied on, and use only Host-Host MPI with Sycl D2D intranode and explicit D2H / H2D copies.

Mar 11 '25 17:03 paboyle

The reproducer using QUDA given in the comment https://github.com/pmodels/mpich/issues/7139#issuecomment-2635541714 is still hanging on Aurora. This is now also being tracked in the Aurora issues tracker https://github.com/argonne-lcf/AuroraBugTracking/issues/17 . The issue with the original reproducer was fixed, so maybe that is being confused with this second issue. Should the remaining issue (the hang) be filed as a new issue?

Apr 08 '25 17:04 jcosborn

Yes, filing as new issue will help clarifying the status.

Apr 08 '25 17:04 hzhou

Ok, the remaining issue is filed in https://github.com/pmodels/mpich/issues/7373. Since the original issue was fixed, I'm closing this one.

Apr 08 '25 17:04 jcosborn