osu_gatherv_persistent lead to "Fatal error in internal_Barrier: Other MPI error"
command:
mpiexec --np 49152 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 --gpu-bind verbose,list:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.1:0.1:0.1:0.1:0.1:0.1:0.1:0.1:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.1:1.1:1.1:1.1:1.1:1.1:1.1:1.1:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.1:2.1:2.1:2.1:2.1:2.1:2.1:2.1:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.1:3.1:3.1:3.1:3.1:3.1:3.1:3.1:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.1:4.1:4.1:4.1:4.1:4.1:4.1:4.1:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.1:5.1:5.1:5.1:5.1:5.1:5.1:5.1 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/512/gather-gather_persistent-gatherv-gatherv_persistent/stage/2025-09-12_18-37-26/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_gatherv_persistent -m 8:8 -i 1000 -x 100 -f -z -d sycl
output:
\# OSU MPI-SYCL Gatherv Persistent Latency Test v7.5 \# Datatype: MPI_CHAR. \# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations P50 Tail Lat(us) P90 Tail Lat(us) P99 Tail Lat(us) Abort(15) on node 11436 (rank 11436 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 11452 (rank 11452 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 11456 (rank 11456 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 11464 (rank 11464 in comm 0): Fatal error in internal_Barrier: Other MPI error Rank 11436 aborted with code 15: Fatal error in internal_Barrier: Other MPI error x4719c3s1b0n0.hsn.cm.aurora.alcf.anl.gov: rank 11436 died from signal 15 Abort(15) on node 37063 (rank 37063 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37079 (rank 37079 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37087 (rank 37087 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37103 (rank 37103 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37059 (rank 37059 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37067 (rank 37067 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37071 (rank 37071 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37075 (rank 37075 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37083 (rank 37083 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37095 (rank 37095 in comm 0): Fatal error in internal_Barrier: Other MPI error Rank 37095 aborted with code 15: Fatal error in internal_Barrier: Other MPI error x4203c1s6b0n0.hsn.cm.aurora.alcf.anl.gov: rank 787 died from signal 6
Command (with tuning file and throttle env variable):
module load mpich-config/collective-tuning/1024 ; MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 ; mpiexec --np 49152 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 --gpu-bind verbose,list:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.1:0.1:0.1:0.1:0.1:0.1:0.1:0.1:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.1:1.1:1.1:1.1:1.1:1.1:1.1:1.1:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.1:2.1:2.1:2.1:2.1:2.1:2.1:2.1:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.1:3.1:3.1:3.1:3.1:3.1:3.1:3.1:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.1:4.1:4.1:4.1:4.1:4.1:4.1:4.1:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.1:5.1:5.1:5.1:5.1:5.1:5.1:5.1 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/512/gather-gather_persistent-gatherv-gatherv_persistent/stage/2025-09-12_18-37-26/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_gatherv_persistent -m 8:8 -i 1000 -x 100 -f -z -d sycl ; unset MPIR_CVAR_CH4_PROGRESS_THROTTLE ; module unload mpich-config/collective-tuning/1024
output:
# OSU MPI-SYCL Gatherv Persistent Latency Test v7.5 \# Datatype: MPI_CHAR. \# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations P50 Tail Lat(us) P90 Tail Lat(us) P99 Tail Lat(us) Abort(15) on node 29157 (rank 29157 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 29161 (rank 29161 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 29181 (rank 29181 in comm 0): Fatal error in internal_Barrier: Other MPI error Rank 29181 aborted with code 15: Fatal error in internal_Barrier: Other MPI error x4201c0s0b0n0.hsn.cm.aurora.alcf.anl.gov: rank 29181 died from signal 15 x4419c0s5b0n0.hsn.cm.aurora.alcf.anl.gov: rank 21760 died from signal 6
Related: cpu alltoall_persistent fails at 32 nodes
cd /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/RunMPIcollective
cat rfm_job.out | grep "ref_collectives-node_count>32-target>cpu-binary>osu_alltoall_persistent-ppn>96-message_length>1024-tune_level>0" -A 10
cat rfm_job.err | grep -E -v "Warning|OMB|multiple|pid|reset" Abort(15) on node 1581 (rank 1581 in comm 0): Fatal error in internal_Wait: Other MPI error Rank 1581 aborted with code 15: Fatal error in internal_Wait: Other MPI error x4114c2s2b0n0.hsn.cm.aurora.alcf.anl.gov: rank 1581 died from signal 15 Abort(15) on node 2043 (rank 2043 in comm 0): Fatal error in internal_Wait: Other MPI error Rank 2043 aborted with code 15: Fatal error in internal_Wait: Other MPI error x4114c2s7b0n0.hsn.cm.aurora.alcf.anl.gov: rank 2043 died from signal 15
Ask Aurora users if anyone is using persistent collectives. If not, deprioritize persistent collective testing and focus on the most used operations.
More examples of alltoall fails in persistent mode
mpiexec --np 3072 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoall_persistent -m 1024:1024 -i 1000 -x 100 -f -z
module load mpich-config/collective-tuning/1024 ; MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 ; mpiexec --np 3072 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoall_persistent -m 1024:1024 -i 1000 -x 100 -f -z ; unset MPIR_CVAR_CH4_PROGRESS_THROTTLE ; module unload mpich-config/collective-tuning/1024
mpiexec --np 12288 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/128/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoall_persistent -m 1024:1024 -i 1000 -x 100 -f -z
module load mpich-config/collective-tuning/1024 ; MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 ; mpiexec --np 12288 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/128/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoall_persistent -m 1024:1024 -i 1000 -x 100 -f -z ; unset MPIR_CVAR_CH4_PROGRESS_THROTTLE ; module unload mpich-config/collective-tuning/1024