mpich icon indicating copy to clipboard operation
mpich copied to clipboard

Hang with MPI send/recv and pipelining on Intel GPUs

Open jcosborn opened this issue 11 months ago • 8 comments

An application is hanging on Aurora with pipelining on. The hang is dependent on the relative message sizes being sent between nodes and within a node. This issue was mentioned in https://github.com/pmodels/mpich/issues/7139#issuecomment-2635541714 and is now being separated into its own issue. It is also mentioned on the Aurora bug tracker https://github.com/argonne-lcf/AuroraBugTracking/issues/17. Here's a build script for a reproducer:

#!/bin/bash
ml cmake

git clone -b feature/sycl https://github.com/lattice/quda
mkdir build && cd build

export QUDA_TARGET=SYCL
export CC=mpicc
export CXX=mpicxx
export QUDA_SYCL_TARGETS="intel_gpu_pvc"
export SYCL_LINK_FLAGS="$SYCL_LINK_FLAGS -fsycl-device-code-split=per_kernel"
export SYCL_LINK_FLAGS="$SYCL_LINK_FLAGS -fsycl-max-parallel-link-jobs=32"
export SYCL_LINK_FLAGS="$SYCL_LINK_FLAGS -flink-huge-device-code"

o="$o -DCMAKE_BUILD_TYPE=RELEASE"
o="$o -DQUDA_DIRAC_DEFAULT_OFF=ON"
o="$o -DQUDA_DIRAC_STAGGERED=ON"
o="$o -DQUDA_FAST_COMPILE_REDUCE=ON"
o="$o -DQUDA_FAST_COMPILE_DSLASH=ON"
o="$o -DQUDA_MPI=ON"
o="$o -DMPIEXEC_EXECUTABLE=`which mpiexec`"

cmake $o ../quda

make -O -j16 staggered_invert_test |& tee build.log

and the run script

#!/bin/bash
#PBS -l select=2
#PBS -l walltime=1:00:00
#PBS -l filesystems=home
#PBS -A Catalyst
#PBS -q debug

hostname
if [ ! -z "$PBS_O_WORKDIR" ]; then
    cd $PBS_O_WORKDIR
fi
module -t --redirect list |sort

export QUDA_ENABLE_TUNING=0
export QUDA_ENABLE_P2P=0
export QUDA_ENABLE_GDR=1
export ZE_FLAT_DEVICE_HIERARCHY=FLAT
export MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1
#export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ=$((8*1024*1024))

asq="--dslash-type asqtad --compute-fat-long false"
inv="--solve-type direct-pc --solution-type mat-pc --inv-type cg --matpc even-even"
par="--prec double --tol 1e-4 --mass 0.04 --niter 1000 --nsrc 3 --multishift 14"
geom="--dim 32 24 24 24 --gridsize 2 2 2 3 --rank-order row"
#geom="--dim 32 24 24 24 --gridsize 2 2 2 3"

mpiexec -np 24 --ppn 12 build/tests/staggered_invert_test $asq $inv $par $geom

As-is this will hang. Uncommenting the buffer size line, or swapping the 'geom' with the commented out one will make it run to completion. The "row" rank-order makes the messages passed between nodes be the same size as the larger of the messages passed within a node (this is the case that hangs). For the default row-order, the messages between nodes are the same size as the smaller of the messages within a node and it doesn't hang.

jcosborn avatar Apr 08 '25 17:04 jcosborn

I think I can reproduce -- did the output look like the following for you?

+ mpiexec -np 24 --ppn 12 build/tests/staggered_invert_test --dslash-type asqtad --compute-fat-long false --solve-type direct-pc --solution-type mat-pc --inv-type cg --matpc even-even --prec double --tol 1e-4 --mass 0.04 --niter 1000 --nsrc 3 --multishift 14 --dim 32 24 24 24 --gridsize 2 2 2 3 --rank-order row
Enabling GPU-Direct RDMA access
Disabling peer-to-peer access
Rank order is row major (x running fastest)
running the following test:
prec    prec_sloppy   multishift  matpc_type  recon  recon_sloppy solve_type S_dimension T_dimension Ls_dimension   dslash_type  normalization
double   double          14        even_even     18     18          direct_pc  32/ 24/ 24      24         16               asqtad     kappa
Grid partition info:     X  Y  Z  T
                         1  1  1  1
QUDA 1.1.0 (git 1.1.0-a699e2be2-SYCL)
SYCL platforms available:
  Intel(R) oneAPI Unified Runtime over Level-Zero Intel(R) Corporation 1.6
Selector score: 11 Intel(R) Data Center GPU Max 1550
Selector score: 11 Intel(R) Data Center GPU Max 1550
[...]
  Atomic memory orders: relaxed acquire release acq_rel seq_cst
  Atomic memory scopes: work_item sub_group work_group device system
WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU)
WARNING: Environment variable QUDA_RESOURCE_PATH is not set.
WARNING: Caching of tuned parameters will be disabled
Creating context... done
WARNING: Autotuning disabled
WARNING: Using device memory pool allocator
WARNING: Using pinned memory pool allocator

and then no more output?

colleeneb avatar Apr 11 '25 16:04 colleeneb

Double check that if the bug still persist by turning FI_HMEM, MPICH_CVAR_CH4_OFI_HMEM_ENABLE=1

hzhou avatar Apr 23 '25 19:04 hzhou

It is still hanging with mpich/opt/develop-git.6037a7a (aurora_test branch) and MPICH_CVAR_CH4_OFI_HMEM_ENABLE=1. Is it worth trying main?

colleeneb avatar Apr 24 '25 23:04 colleeneb

It is still hanging with mpich/opt/develop-git.6037a7a (aurora_test branch) and MPICH_CVAR_CH4_OFI_HMEM_ENABLE=1. Is it worth trying main?

Then this may not even be related to the pipelining path. Does the app hang with MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=0?

hzhou avatar Apr 24 '25 23:04 hzhou

I can confirm it still hangs with MPICH_CVAR_CH4_OFI_HMEM_ENABLE=1 with the Aurora default. With MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=0 it doesn't hang.

jcosborn avatar Apr 25 '25 16:04 jcosborn

Thanks @jcosborn for the confirmation. Is the app using non-contiguous datatypes. The non-contig path was not enabled for pipilining in the early versions. We enabled the path since we didn't see why the non-contig types won't work.

To confirm - is the difference between rank order options just message sizes between nodes or does it also introduce non-contiguous datatypes?

hzhou avatar Apr 25 '25 16:04 hzhou

Uncommenting the buffer size line, or swapping the 'geom' with the commented out one will make it run to completion.

That is making the pipelining chunks bigger resulting less outstanding chunks on the fly. I wonder whether the hang is related to network stress.

@jcosborn In case those experience (changing the buffer size makes it run) was from old tests, could you reconfirm them?

hzhou avatar Apr 25 '25 17:04 hzhou

I think it is only using contiguous buffers. Yes, increasing the buffer size avoids the hang.

jcosborn avatar Apr 25 '25 17:04 jcosborn

The original pipeline algorithm will be replaced in https://github.com/pmodels/mpich/pull/7529.

hzhou avatar Aug 11 '25 15:08 hzhou

I tested this on sunspot with aurora_test branch commit 4fda512 (just the default module mpich/opt/5.0.0.git.4fda512 there). It looks like the reproducer still hung. When I added export MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD=1000000 on a recommendation, it crashed with:

ERROR: (MPI) Other MPI error (rank 4, host x1921c0s4b0n0, communicator_mpi.cpp:387 in void quda::Communicator::comm_wait(MsgHandle *)())
       last kernel called was (name=N4quda12ExtractGhostIdLi3ENS_5gauge11FloatNOrderIdLi18ELi2ELi18EL20QudaStaggeredPhase_s0ELb1EL19QudaGhostExchange_sn2147483648ELb0EEEEE,volume=32x24x24x24,aux=GPU-offline,kernel_arg_threshold=2040,vol=442368,stride=230400,precision=8,geometry=4,Nc=3,extract)

I'm not sure if that message helps anything. But I will test again once https://github.com/pmodels/mpich/pull/7529 is merged too.

colleeneb avatar Aug 20 '25 20:08 colleeneb

@colleeneb MPICH_CVAR_CH4_OFI_HMEM_ENABLE is off, right?

hzhou avatar Aug 20 '25 21:08 hzhou

Yes, it is off (i.e. MPICH_CVAR_CH4_OFI_HMEM_ENABLE is unset)

colleeneb avatar Aug 21 '25 19:08 colleeneb

Can we get the long form of the error message, e.g the MPICH error stack?

hzhou avatar Aug 21 '25 19:08 hzhou

Yes, I will try with debug mpich in a bit and report back.

colleeneb avatar Aug 21 '25 19:08 colleeneb

It appears the user may be setting MPI_ERRORS_RETURN as the MPI error handler. We should see if the app can leave the default fatal error handler, or if we can override the user setting somehow in MPICH to get the full error stack.

raffenet avatar Aug 27 '25 19:08 raffenet

Some updates from running on sunspot today:

  • Sunspot with mpich/opt/5.0.0.git.4fda512: Same as above, it crashes with the
ERROR: (MPI) Other MPI error (rank 19, host x1922c2s2b0n0, communicator_mpi.cpp:387 in void quda::Communicator::comm_wait(MsgHandle *)())
       last kernel called was (name=N4quda12ExtractGhostIdLi3ENS_5gauge11FloatNOrderIdLi18ELi2ELi18EL20QudaStaggeredPhase_s0ELb1EL19QudaGhostExchange_sn2147483648ELb0EEEEE,volume=32x24x24x24,aux=GPU-offline,kernel_arg_threshold=2040,vol=442368,stride=230400,precision=8,geometry=4,Nc=3,extract)
  • Sunspot with mpich/opt/5.0.0.git.4fda512 and unset MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD: Runs to completion as far as I can tell! James can you confirm? There's nothing in the error file and the output file ends with:
Device memory used = 2392.2 MB
Pinned device memory used = 0.0 MB
Managed memory used = 0.0 MB
Page-locked host memory used = 937.2 MB
Total host memory used >= 2652.1 MB
  • Sunspot with mpich/dbg/5.0.0.git.4fda512 I switched to this to try to get the debug stacktrace as discussed but the build hangs. I'm not sure why yet. I know there are issues with the latest SDK for debug builds of some apps (super slow or crashing) but I'm surprised that just switch to debug mpich would cause this.

So in conclusion right now, it looks like not hanging if MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD is unset -- I'm familiar enough with that to know why it would affect. And compiling with mpich/dbg is causing a very slow QUDA build, which we'll need to look into on the compiler side.

colleeneb avatar Aug 27 '25 22:08 colleeneb

I got the error:

Abort(894120335) on node 23 (rank 23 in comm 496): Fatal error in internal_Wait: Other MPI error, error stack:
internal_Wait(72040)..........: MPI_Wait(request=0x8aed160, status=0x1) failed
MPIR_Wait(741)................:
MPIR_Wait_state(698)..........:
MPIDI_progress_test(171)......:
MPIDI_NM_progress(108)........:
MPIDI_OFI_handle_cq_error(550): OFI poll failed (default nic=cxi3: Truncation error)
x4302c3s7b0n0.hsn.cm.aurora.alcf.anl.gov: rank 19 exited with code 143

hzhou avatar Oct 07 '25 19:10 hzhou

As usual, once I added debug printfs, the bug hides away :(

hzhou avatar Oct 07 '25 21:10 hzhou

Now I couldn't get the error even without the printf :(

hzhou avatar Oct 07 '25 22:10 hzhou

Okay, triggers the error 30% of the time. I think I got some clues now.

hzhou avatar Oct 07 '25 22:10 hzhou

Thanks a lot!

colleeneb avatar Oct 07 '25 22:10 colleeneb

This is currently how pipeline works: Sender:

  1. while (chunks_remain) issue_chunk_async_copy
  2. for-any chunk async_copy done -> issue chunk send

Receirver:

  1. while (chunks_remain) issue chunk receive
  2. for-any received chunk -> issue async copy

The sender in step 2. for-any may complete out of order, for example, both chunk 1 and 2 complete between progress iterating from chunk 1 to 2, resulting chunk 2 being sent before chunk 1. This out-of-order send mis-matches the corresponding chunk receives.

Solution: if chunk 1 copy is not complete, skip checking later chunks. This ensures the chunk sends are in order.

PR is coming.

hzhou avatar Oct 08 '25 02:10 hzhou