mpich
mpich copied to clipboard
osu_ialltoall Other MPI error starting at 32 nodes 96 ppn
This happened on current image at 512 nodes.
This happened on next-eval at 32, 128, 512 nodes. The 32 node output is in the following path
/lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/ialltoall-ialltoallv-ialltoallw/stage/2025-09-14_11-09-45/aurora/compute/PrgEnv-intel/RunMPIcollective
cat rfm_job.err | grep "error"
Abort(15) on node 1632 (rank 1632 in comm 0): Fatal error in internal_Wait: Other MPI error
Rank 1632 aborted with code 15: Fatal error in internal_Wait: Other MPI error
The error is encountered with a call of mpiexec on ialltoall
Are you running at 96 PPN?
I suspect the error is a libfabric I/O error, similar to the other issues when the network is overwhelmed and libcxi bails out.
cd /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/ialltoall-ialltoallv-ialltoallw/stage/2025-09-14_11-09-45/aurora/compute/PrgEnv-intel/RunMPIcollective
awk 'BEGIN{N=7} {if(prev~/Lat\(us\)/&&/Sun/){for(i=NR-N;i<NR;i++)if(i>0)print buffer[i%N];print $0;count=1} else if(count>0){print "\n\n";count--} buffer[NR%N]=$0; prev=$0}' rfm_job.out
mpiexec --np 3072 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/ialltoall-ialltoallv-ialltoallw/stage/2025-09-14_11-09-45/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_ialltoall -m 1024:1024 -i 1000 -x 100 -f -z
# OSU MPI Non-blocking All-to-All Latency Test v7.5
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait
# Datatype: MPI_CHAR.
# Size Overall(us) Compute(us) Coll. Init(us) MPI_Test(us) MPI_Wait(us) Pure Comm.(us) Min Comm.(us) Max Comm.(us) Overlap(%) P50 Tail Lat(us) P90 Tail Lat(us) P99 Tail Lat(us)
Sun 14 Sep 2025 04:15:16 PM UTC
cd /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/ialltoall-ialltoallv-ialltoallw/stage/2025-09-13_17-49-31/aurora/compute/PrgEnv-intel/RunMPIcollective
awk 'BEGIN{N=7} {if(prev~/Lat\(us\)/&&/Sun/){for(i=NR-N;i<NR;i++)if(i>0)print buffer[i%N];print $0;count=1} else if(count>0){print "\n\n";count--} buffer[NR%N]=$0; prev=$0}' rfm_job.out
mpiexec --np 3072 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/ialltoall-ialltoallv-ialltoallw/stage/2025-09-13_17-49-31/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_ialltoall -m 1024:1024 -i 1000 -x 100 -f -z
# OSU MPI Non-blocking All-to-All Latency Test v7.5
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait
# Datatype: MPI_CHAR.
# Size Overall(us) Compute(us) Coll. Init(us) MPI_Test(us) MPI_Wait(us) Pure Comm.(us) Min Comm.(us) Max Comm.(us) Overlap(%) P50 Tail Lat(us) P90 Tail Lat(us) P99 Tail Lat(us)
Sun 14 Sep 2025 12:15:30 AM UTC
I suspect the error is a libfabric I/O error,
Does this help explain that the error occurs at lower node count in the next-eval image (32) than the current image (512)?