Hang with MPI send/recv and pipelining on Intel GPUs
An application is hanging on Aurora with pipelining on. The hang is dependent on the relative message sizes being sent between nodes and within a node. This issue was mentioned in https://github.com/pmodels/mpich/issues/7139#issuecomment-2635541714 and is now being separated into its own issue. It is also mentioned on the Aurora bug tracker https://github.com/argonne-lcf/AuroraBugTracking/issues/17. Here's a build script for a reproducer:
#!/bin/bash
ml cmake
git clone -b feature/sycl https://github.com/lattice/quda
mkdir build && cd build
export QUDA_TARGET=SYCL
export CC=mpicc
export CXX=mpicxx
export QUDA_SYCL_TARGETS="intel_gpu_pvc"
export SYCL_LINK_FLAGS="$SYCL_LINK_FLAGS -fsycl-device-code-split=per_kernel"
export SYCL_LINK_FLAGS="$SYCL_LINK_FLAGS -fsycl-max-parallel-link-jobs=32"
export SYCL_LINK_FLAGS="$SYCL_LINK_FLAGS -flink-huge-device-code"
o="$o -DCMAKE_BUILD_TYPE=RELEASE"
o="$o -DQUDA_DIRAC_DEFAULT_OFF=ON"
o="$o -DQUDA_DIRAC_STAGGERED=ON"
o="$o -DQUDA_FAST_COMPILE_REDUCE=ON"
o="$o -DQUDA_FAST_COMPILE_DSLASH=ON"
o="$o -DQUDA_MPI=ON"
o="$o -DMPIEXEC_EXECUTABLE=`which mpiexec`"
cmake $o ../quda
make -O -j16 staggered_invert_test |& tee build.log
and the run script
#!/bin/bash
#PBS -l select=2
#PBS -l walltime=1:00:00
#PBS -l filesystems=home
#PBS -A Catalyst
#PBS -q debug
hostname
if [ ! -z "$PBS_O_WORKDIR" ]; then
cd $PBS_O_WORKDIR
fi
module -t --redirect list |sort
export QUDA_ENABLE_TUNING=0
export QUDA_ENABLE_P2P=0
export QUDA_ENABLE_GDR=1
export ZE_FLAT_DEVICE_HIERARCHY=FLAT
export MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1
#export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ=$((8*1024*1024))
asq="--dslash-type asqtad --compute-fat-long false"
inv="--solve-type direct-pc --solution-type mat-pc --inv-type cg --matpc even-even"
par="--prec double --tol 1e-4 --mass 0.04 --niter 1000 --nsrc 3 --multishift 14"
geom="--dim 32 24 24 24 --gridsize 2 2 2 3 --rank-order row"
#geom="--dim 32 24 24 24 --gridsize 2 2 2 3"
mpiexec -np 24 --ppn 12 build/tests/staggered_invert_test $asq $inv $par $geom
As-is this will hang. Uncommenting the buffer size line, or swapping the 'geom' with the commented out one will make it run to completion. The "row" rank-order makes the messages passed between nodes be the same size as the larger of the messages passed within a node (this is the case that hangs). For the default row-order, the messages between nodes are the same size as the smaller of the messages within a node and it doesn't hang.
I think I can reproduce -- did the output look like the following for you?
+ mpiexec -np 24 --ppn 12 build/tests/staggered_invert_test --dslash-type asqtad --compute-fat-long false --solve-type direct-pc --solution-type mat-pc --inv-type cg --matpc even-even --prec double --tol 1e-4 --mass 0.04 --niter 1000 --nsrc 3 --multishift 14 --dim 32 24 24 24 --gridsize 2 2 2 3 --rank-order row
Enabling GPU-Direct RDMA access
Disabling peer-to-peer access
Rank order is row major (x running fastest)
running the following test:
prec prec_sloppy multishift matpc_type recon recon_sloppy solve_type S_dimension T_dimension Ls_dimension dslash_type normalization
double double 14 even_even 18 18 direct_pc 32/ 24/ 24 24 16 asqtad kappa
Grid partition info: X Y Z T
1 1 1 1
QUDA 1.1.0 (git 1.1.0-a699e2be2-SYCL)
SYCL platforms available:
Intel(R) oneAPI Unified Runtime over Level-Zero Intel(R) Corporation 1.6
Selector score: 11 Intel(R) Data Center GPU Max 1550
Selector score: 11 Intel(R) Data Center GPU Max 1550
[...]
Atomic memory orders: relaxed acquire release acq_rel seq_cst
Atomic memory scopes: work_item sub_group work_group device system
WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU)
WARNING: Environment variable QUDA_RESOURCE_PATH is not set.
WARNING: Caching of tuned parameters will be disabled
Creating context... done
WARNING: Autotuning disabled
WARNING: Using device memory pool allocator
WARNING: Using pinned memory pool allocator
and then no more output?
Double check that if the bug still persist by turning FI_HMEM, MPICH_CVAR_CH4_OFI_HMEM_ENABLE=1
It is still hanging with mpich/opt/develop-git.6037a7a (aurora_test branch) and MPICH_CVAR_CH4_OFI_HMEM_ENABLE=1. Is it worth trying main?
It is still hanging with
mpich/opt/develop-git.6037a7a(aurora_test branch) andMPICH_CVAR_CH4_OFI_HMEM_ENABLE=1. Is it worth trying main?
Then this may not even be related to the pipelining path. Does the app hang with MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=0?
I can confirm it still hangs with MPICH_CVAR_CH4_OFI_HMEM_ENABLE=1 with the Aurora default.
With MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=0 it doesn't hang.
Thanks @jcosborn for the confirmation. Is the app using non-contiguous datatypes. The non-contig path was not enabled for pipilining in the early versions. We enabled the path since we didn't see why the non-contig types won't work.
To confirm - is the difference between rank order options just message sizes between nodes or does it also introduce non-contiguous datatypes?
Uncommenting the buffer size line, or swapping the 'geom' with the commented out one will make it run to completion.
That is making the pipelining chunks bigger resulting less outstanding chunks on the fly. I wonder whether the hang is related to network stress.
@jcosborn In case those experience (changing the buffer size makes it run) was from old tests, could you reconfirm them?
I think it is only using contiguous buffers. Yes, increasing the buffer size avoids the hang.
The original pipeline algorithm will be replaced in https://github.com/pmodels/mpich/pull/7529.
I tested this on sunspot with aurora_test branch commit 4fda512 (just the default module mpich/opt/5.0.0.git.4fda512 there). It looks like the reproducer still hung. When I added export MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD=1000000 on a recommendation, it crashed with:
ERROR: (MPI) Other MPI error (rank 4, host x1921c0s4b0n0, communicator_mpi.cpp:387 in void quda::Communicator::comm_wait(MsgHandle *)())
last kernel called was (name=N4quda12ExtractGhostIdLi3ENS_5gauge11FloatNOrderIdLi18ELi2ELi18EL20QudaStaggeredPhase_s0ELb1EL19QudaGhostExchange_sn2147483648ELb0EEEEE,volume=32x24x24x24,aux=GPU-offline,kernel_arg_threshold=2040,vol=442368,stride=230400,precision=8,geometry=4,Nc=3,extract)
I'm not sure if that message helps anything. But I will test again once https://github.com/pmodels/mpich/pull/7529 is merged too.
@colleeneb MPICH_CVAR_CH4_OFI_HMEM_ENABLE is off, right?
Yes, it is off (i.e. MPICH_CVAR_CH4_OFI_HMEM_ENABLE is unset)
Can we get the long form of the error message, e.g the MPICH error stack?
Yes, I will try with debug mpich in a bit and report back.
It appears the user may be setting MPI_ERRORS_RETURN as the MPI error handler. We should see if the app can leave the default fatal error handler, or if we can override the user setting somehow in MPICH to get the full error stack.
Some updates from running on sunspot today:
- Sunspot with
mpich/opt/5.0.0.git.4fda512: Same as above, it crashes with the
ERROR: (MPI) Other MPI error (rank 19, host x1922c2s2b0n0, communicator_mpi.cpp:387 in void quda::Communicator::comm_wait(MsgHandle *)())
last kernel called was (name=N4quda12ExtractGhostIdLi3ENS_5gauge11FloatNOrderIdLi18ELi2ELi18EL20QudaStaggeredPhase_s0ELb1EL19QudaGhostExchange_sn2147483648ELb0EEEEE,volume=32x24x24x24,aux=GPU-offline,kernel_arg_threshold=2040,vol=442368,stride=230400,precision=8,geometry=4,Nc=3,extract)
- Sunspot with
mpich/opt/5.0.0.git.4fda512andunset MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD: Runs to completion as far as I can tell! James can you confirm? There's nothing in the error file and the output file ends with:
Device memory used = 2392.2 MB
Pinned device memory used = 0.0 MB
Managed memory used = 0.0 MB
Page-locked host memory used = 937.2 MB
Total host memory used >= 2652.1 MB
- Sunspot with
mpich/dbg/5.0.0.git.4fda512I switched to this to try to get the debug stacktrace as discussed but the build hangs. I'm not sure why yet. I know there are issues with the latest SDK for debug builds of some apps (super slow or crashing) but I'm surprised that just switch to debug mpich would cause this.
So in conclusion right now, it looks like not hanging if MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD is unset -- I'm familiar enough with that to know why it would affect. And compiling with mpich/dbg is causing a very slow QUDA build, which we'll need to look into on the compiler side.
I got the error:
Abort(894120335) on node 23 (rank 23 in comm 496): Fatal error in internal_Wait: Other MPI error, error stack:
internal_Wait(72040)..........: MPI_Wait(request=0x8aed160, status=0x1) failed
MPIR_Wait(741)................:
MPIR_Wait_state(698)..........:
MPIDI_progress_test(171)......:
MPIDI_NM_progress(108)........:
MPIDI_OFI_handle_cq_error(550): OFI poll failed (default nic=cxi3: Truncation error)
x4302c3s7b0n0.hsn.cm.aurora.alcf.anl.gov: rank 19 exited with code 143
As usual, once I added debug printfs, the bug hides away :(
Now I couldn't get the error even without the printf :(
Okay, triggers the error 30% of the time. I think I got some clues now.
Thanks a lot!
This is currently how pipeline works: Sender:
- while (chunks_remain) issue_chunk_async_copy
- for-any chunk async_copy done -> issue chunk send
Receirver:
- while (chunks_remain) issue chunk receive
- for-any received chunk -> issue async copy
The sender in step 2. for-any may complete out of order, for example, both chunk 1 and 2 complete between progress iterating from chunk 1 to 2, resulting chunk 2 being sent before chunk 1. This out-of-order send mis-matches the corresponding chunk receives.
Solution: if chunk 1 copy is not complete, skip checking later chunks. This ensures the chunk sends are in order.
PR is coming.