Crash when calling too many MPI_Probe ?
Hi,
Yesterday I encountered issue with MPI_Probe(s) when used on a large scale (on Aurora with ). To clarify the context the error was reproduced with 1536 process, each sending/receiving about 200 messages using P2P communications (Isend, Probe, Get_Count, Recv).
Using mpich/dbg/develop-git.6037a7a we get the following error on many processes (we checked that it is not always the same hostname):
Rank 1288 aborted with code 606242447: Fatal error in internal_Probe: Other MPI error, error stack:
internal_Probe(91)............: MPI_Probe(176, 27319, MPI_COMM_WORLD, status=0x7ffed222a9a0) failed
MPID_Probe(107)...............:
MPIDI_iprobe(33)..............:
MPIDI_OFI_do_iprobe(88).......:
MPIDI_OFI_handle_cq_error(789): OFI poll failed (default nic=cxi0: Input/output error)
Also the error is always a the same spot in the code and is reproduced consistently every time.
We tried replacing the Probe, Get_Count patern in the communication by a AllGather on the comm sizes whch fixes the issue. So the error is indeed Probe related. Could also be a libfabric issue ...
Reproducer
I'm very sorry that i did not managed to isolate the issue away of our code to have a reproducer, but the issue very reliably occur every time (on 256 Aurora nodes).
Here are the steps to reproduce (I check this morning that it does indeed):
git clone --recurse-submodules -b reproducer-aurora/mpi-probe-issue https://github.com/tdavidcl/Shamrock.git Shamrock_reproducer
cd Shamrock_reproducer/
./env/new-env --machine argonne.aurora --builddir build --
cd build
source activate
shamconfigure
shammake
qsub test256nodes.sh
Here is the submission script test256nodes.sh:
#!/bin/bash -l
#PBS -A <project name>
#PBS -N reproducer
#PBS -l walltime=0:30:00
#PBS -l select=256
#PBS -l place=scatter
#PBS -l filesystems=home:flare
#PBS -q prod
#PBS -k doe
export TZ='/usr/share/zoneinfo/US/Central'
cd ${PBS_O_WORKDIR}
NNODES=`wc -l < $PBS_NODEFILE`
NRANKS=6 # Number of MPI ranks to spawn per node
NTOTRANKS=$(( NNODES * NRANKS ))
echo "NUM_OF_NODES= ${NNODES} TOTAL_NUM_RANKS= ${NTOTRANKS} RANKS_PER_NODE= ${NRANKS}"
WORKDIR=/lus/flare/projects/Shamrock
echo "pwd=$(pwd)"
. ./activate
mpiexec -n ${NTOTRANKS} -ppn ${NRANKS} ./1device_per_process_directgpu.sh ./shamrock --smi --sycl-cfg auto:oneAPI --benchmark-mpi --force-dgpu-off --loglevel 1 --rscript ../exemples/sph_weak_scale_test.py
Comparing against the fix
In src/shamalgs/include/shamalgs/collective/sparseXchg.hpp there is the block of code :
inline void sparse_comm_c(
std::shared_ptr<sham::DeviceScheduler> dev_sched,
const std::vector<SendPayload> &message_send,
std::vector<RecvPayload> &message_recv,
const SparseCommTable &comm_table) {
sparse_comm_debug_infos(dev_sched, message_send, message_recv, comm_table);
// Using the second function instead of the first one fix the issue
sparse_comm_isend_probe_count_irecv(dev_sched, message_send, message_recv, comm_table);
// sparse_comm_allgather_isend_irecv(dev_sched, message_send, message_recv, comm_table);
}
commenting sparse_comm_isend_probe_count_irecv and enabling sparse_comm_allgather_isend_irecv works.
Need to confirm if this issue exists with hybrid match mode.
I just tried with
#!/bin/bash -l
#PBS -A Shamrock
#PBS -N scale_256_hybrid
#PBS -l walltime=0:15:00
#PBS -l select=256
#PBS -l place=scatter
#PBS -l filesystems=home:flare
#PBS -q prod
#PBS -k doe
export TZ='/usr/share/zoneinfo/US/Central'
cd ${PBS_O_WORKDIR}
NNODES=`wc -l < $PBS_NODEFILE`
NRANKS=6 # Number of MPI ranks to spawn per node
NTOTRANKS=$(( NNODES * NRANKS ))
echo "NUM_OF_NODES= ${NNODES} TOTAL_NUM_RANKS= ${NTOTRANKS} RANKS_PER_NODE= ${NRANKS}"
WORKDIR=/lus/flare/projects/Shamrock
echo "pwd=$(pwd)"
. ./activate
export FI_CXI_RX_MATCH_MODE=hybrid
mpiexec -n ${NTOTRANKS} -ppn ${NRANKS} ./1device_per_process_directgpu.sh ./shamrock --smi --sycl-cfg auto:oneAPI --benchmark-mpi --force-dgpu-off --loglevel 1 --rscript ../exemples/sph_weak_scale_test.py
And still I get
Abort(337806991) on node 299 (rank 299 in comm 0): Fatal error in internal_Probe: Other MPI error, error stack:
internal_Probe(91)............: MPI_Probe(658, 101838, MPI_COMM_WORLD, status=0x7ffe98363000) failed
MPID_Probe(107)...............:
MPIDI_iprobe(33)..............:
MPIDI_OFI_do_iprobe(88).......:
MPIDI_OFI_handle_cq_error(789): OFI poll failed (default nic=cxi1: Input/output error)
try with debug libfabric and capture more detailed logs for the failure
we should also consider increasing the cq buffer size
Hi, I haven't been able to continue on this those last months. I will try to make a reproducer to ease the tracking