mpich Crash when calling too many MPI

Hi,

Yesterday I encountered issue with MPI_Probe(s) when used on a large scale (on Aurora with ). To clarify the context the error was reproduced with 1536 process, each sending/receiving about 200 messages using P2P communications (Isend, Probe, Get_Count, Recv).

Using mpich/dbg/develop-git.6037a7a we get the following error on many processes (we checked that it is not always the same hostname):

Rank 1288 aborted with code 606242447: Fatal error in internal_Probe: Other MPI error, error stack:
internal_Probe(91)............: MPI_Probe(176, 27319, MPI_COMM_WORLD, status=0x7ffed222a9a0) failed
MPID_Probe(107)...............: 
MPIDI_iprobe(33)..............: 
MPIDI_OFI_do_iprobe(88).......: 
MPIDI_OFI_handle_cq_error(789): OFI poll failed (default nic=cxi0: Input/output error)

Also the error is always a the same spot in the code and is reproduced consistently every time.

We tried replacing the Probe, Get_Count patern in the communication by a AllGather on the comm sizes whch fixes the issue. So the error is indeed Probe related. Could also be a libfabric issue ...

Reproducer

I'm very sorry that i did not managed to isolate the issue away of our code to have a reproducer, but the issue very reliably occur every time (on 256 Aurora nodes).

Here are the steps to reproduce (I check this morning that it does indeed):

git clone --recurse-submodules -b reproducer-aurora/mpi-probe-issue https://github.com/tdavidcl/Shamrock.git Shamrock_reproducer
cd Shamrock_reproducer/
./env/new-env --machine argonne.aurora --builddir build --
cd build
source activate 
shamconfigure
shammake

qsub test256nodes.sh

Here is the submission script test256nodes.sh:

#!/bin/bash -l
#PBS -A <project name>
#PBS -N reproducer
#PBS -l walltime=0:30:00
#PBS -l select=256
#PBS -l place=scatter
#PBS -l filesystems=home:flare
#PBS -q prod
#PBS -k doe

export TZ='/usr/share/zoneinfo/US/Central'
cd ${PBS_O_WORKDIR}

NNODES=`wc -l < $PBS_NODEFILE`
NRANKS=6 # Number of MPI ranks to spawn per node

NTOTRANKS=$(( NNODES * NRANKS ))

echo "NUM_OF_NODES= ${NNODES} TOTAL_NUM_RANKS= ${NTOTRANKS} RANKS_PER_NODE= ${NRANKS}"

WORKDIR=/lus/flare/projects/Shamrock

echo "pwd=$(pwd)"

. ./activate

mpiexec -n ${NTOTRANKS} -ppn ${NRANKS} ./1device_per_process_directgpu.sh ./shamrock --smi --sycl-cfg auto:oneAPI --benchmark-mpi --force-dgpu-off --loglevel 1 --rscript ../exemples/sph_weak_scale_test.py

Comparing against the fix

In src/shamalgs/include/shamalgs/collective/sparseXchg.hpp there is the block of code :

    inline void sparse_comm_c(
        std::shared_ptr<sham::DeviceScheduler> dev_sched,
        const std::vector<SendPayload> &message_send,
        std::vector<RecvPayload> &message_recv,
        const SparseCommTable &comm_table) {
        sparse_comm_debug_infos(dev_sched, message_send, message_recv, comm_table);
        
        // Using the second function instead of the first one fix the issue
        sparse_comm_isend_probe_count_irecv(dev_sched, message_send, message_recv, comm_table);
        // sparse_comm_allgather_isend_irecv(dev_sched, message_send, message_recv, comm_table);
    }

commenting sparse_comm_isend_probe_count_irecv and enabling sparse_comm_allgather_isend_irecv works.

May 15 '25 10:05 tdavidcl

Need to confirm if this issue exists with hybrid match mode.

Jun 25 '25 19:06 raffenet

I just tried with

#!/bin/bash -l
#PBS -A Shamrock
#PBS -N scale_256_hybrid
#PBS -l walltime=0:15:00
#PBS -l select=256
#PBS -l place=scatter
#PBS -l filesystems=home:flare
#PBS -q prod
#PBS -k doe

export TZ='/usr/share/zoneinfo/US/Central'
cd ${PBS_O_WORKDIR}

NNODES=`wc -l < $PBS_NODEFILE`
NRANKS=6 # Number of MPI ranks to spawn per node

NTOTRANKS=$(( NNODES * NRANKS ))

echo "NUM_OF_NODES= ${NNODES} TOTAL_NUM_RANKS= ${NTOTRANKS} RANKS_PER_NODE= ${NRANKS}"

WORKDIR=/lus/flare/projects/Shamrock

echo "pwd=$(pwd)"

. ./activate

export FI_CXI_RX_MATCH_MODE=hybrid

mpiexec -n ${NTOTRANKS} -ppn ${NRANKS} ./1device_per_process_directgpu.sh ./shamrock --smi --sycl-cfg auto:oneAPI --benchmark-mpi --force-dgpu-off --loglevel 1 --rscript ../exemples/sph_weak_scale_test.py

And still I get

Abort(337806991) on node 299 (rank 299 in comm 0): Fatal error in internal_Probe: Other MPI error, error stack:
internal_Probe(91)............: MPI_Probe(658, 101838, MPI_COMM_WORLD, status=0x7ffe98363000) failed
MPID_Probe(107)...............: 
MPIDI_iprobe(33)..............: 
MPIDI_OFI_do_iprobe(88).......: 
MPIDI_OFI_handle_cq_error(789): OFI poll failed (default nic=cxi1: Input/output error)

Jul 01 '25 09:07 tdavidcl

try with debug libfabric and capture more detailed logs for the failure

Aug 27 '25 19:08 raffenet

we should also consider increasing the cq buffer size

Aug 27 '25 19:08 raffenet

Hi, I haven't been able to continue on this those last months. I will try to make a reproducer to ease the tracking

Dec 14 '25 11:12 tdavidcl

Crash when calling too many MPI_Probe ?

Reproducer

Comparing against the fix