mpich icon indicating copy to clipboard operation
mpich copied to clipboard

osu_gatherv_persistent lead to "Fatal error in internal_Barrier: Other MPI error"

Open longfei-austin opened this issue 4 months ago • 4 comments

command: mpiexec --np 49152 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 --gpu-bind verbose,list:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.1:0.1:0.1:0.1:0.1:0.1:0.1:0.1:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.1:1.1:1.1:1.1:1.1:1.1:1.1:1.1:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.1:2.1:2.1:2.1:2.1:2.1:2.1:2.1:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.1:3.1:3.1:3.1:3.1:3.1:3.1:3.1:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.1:4.1:4.1:4.1:4.1:4.1:4.1:4.1:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.1:5.1:5.1:5.1:5.1:5.1:5.1:5.1 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/512/gather-gather_persistent-gatherv-gatherv_persistent/stage/2025-09-12_18-37-26/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_gatherv_persistent -m 8:8 -i 1000 -x 100 -f -z -d sycl

output: \# OSU MPI-SYCL Gatherv Persistent Latency Test v7.5 \# Datatype: MPI_CHAR. \# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations P50 Tail Lat(us) P90 Tail Lat(us) P99 Tail Lat(us) Abort(15) on node 11436 (rank 11436 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 11452 (rank 11452 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 11456 (rank 11456 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 11464 (rank 11464 in comm 0): Fatal error in internal_Barrier: Other MPI error Rank 11436 aborted with code 15: Fatal error in internal_Barrier: Other MPI error x4719c3s1b0n0.hsn.cm.aurora.alcf.anl.gov: rank 11436 died from signal 15 Abort(15) on node 37063 (rank 37063 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37079 (rank 37079 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37087 (rank 37087 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37103 (rank 37103 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37059 (rank 37059 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37067 (rank 37067 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37071 (rank 37071 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37075 (rank 37075 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37083 (rank 37083 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 37095 (rank 37095 in comm 0): Fatal error in internal_Barrier: Other MPI error Rank 37095 aborted with code 15: Fatal error in internal_Barrier: Other MPI error x4203c1s6b0n0.hsn.cm.aurora.alcf.anl.gov: rank 787 died from signal 6

longfei-austin avatar Sep 13 '25 18:09 longfei-austin

Command (with tuning file and throttle env variable): module load mpich-config/collective-tuning/1024 ; MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 ; mpiexec --np 49152 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 --gpu-bind verbose,list:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.1:0.1:0.1:0.1:0.1:0.1:0.1:0.1:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.1:1.1:1.1:1.1:1.1:1.1:1.1:1.1:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.1:2.1:2.1:2.1:2.1:2.1:2.1:2.1:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.1:3.1:3.1:3.1:3.1:3.1:3.1:3.1:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.1:4.1:4.1:4.1:4.1:4.1:4.1:4.1:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.1:5.1:5.1:5.1:5.1:5.1:5.1:5.1 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/512/gather-gather_persistent-gatherv-gatherv_persistent/stage/2025-09-12_18-37-26/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_gatherv_persistent -m 8:8 -i 1000 -x 100 -f -z -d sycl ; unset MPIR_CVAR_CH4_PROGRESS_THROTTLE ; module unload mpich-config/collective-tuning/1024

output: # OSU MPI-SYCL Gatherv Persistent Latency Test v7.5 \# Datatype: MPI_CHAR. \# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations P50 Tail Lat(us) P90 Tail Lat(us) P99 Tail Lat(us) Abort(15) on node 29157 (rank 29157 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 29161 (rank 29161 in comm 0): Fatal error in internal_Barrier: Other MPI error Abort(15) on node 29181 (rank 29181 in comm 0): Fatal error in internal_Barrier: Other MPI error Rank 29181 aborted with code 15: Fatal error in internal_Barrier: Other MPI error x4201c0s0b0n0.hsn.cm.aurora.alcf.anl.gov: rank 29181 died from signal 15 x4419c0s5b0n0.hsn.cm.aurora.alcf.anl.gov: rank 21760 died from signal 6

longfei-austin avatar Sep 13 '25 18:09 longfei-austin

Related: cpu alltoall_persistent fails at 32 nodes

cd /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/RunMPIcollective

cat rfm_job.out | grep "ref_collectives-node_count>32-target>cpu-binary>osu_alltoall_persistent-ppn>96-message_length>1024-tune_level>0" -A 10

cat rfm_job.err | grep -E -v "Warning|OMB|multiple|pid|reset" Abort(15) on node 1581 (rank 1581 in comm 0): Fatal error in internal_Wait: Other MPI error Rank 1581 aborted with code 15: Fatal error in internal_Wait: Other MPI error x4114c2s2b0n0.hsn.cm.aurora.alcf.anl.gov: rank 1581 died from signal 15 Abort(15) on node 2043 (rank 2043 in comm 0): Fatal error in internal_Wait: Other MPI error Rank 2043 aborted with code 15: Fatal error in internal_Wait: Other MPI error x4114c2s7b0n0.hsn.cm.aurora.alcf.anl.gov: rank 2043 died from signal 15

longfei-austin avatar Sep 24 '25 17:09 longfei-austin

Ask Aurora users if anyone is using persistent collectives. If not, deprioritize persistent collective testing and focus on the most used operations.

raffenet avatar Sep 24 '25 19:09 raffenet

More examples of alltoall fails in persistent mode

mpiexec --np 3072 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoall_persistent -m 1024:1024 -i 1000 -x 100 -f -z

module load mpich-config/collective-tuning/1024 ; MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 ; mpiexec --np 3072 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoall_persistent -m 1024:1024 -i 1000 -x 100 -f -z ; unset MPIR_CVAR_CH4_PROGRESS_THROTTLE ; module unload mpich-config/collective-tuning/1024

mpiexec --np 12288 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/128/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoall_persistent -m 1024:1024 -i 1000 -x 100 -f -z

module load mpich-config/collective-tuning/1024 ; MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 ; mpiexec --np 12288 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/128/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoall_persistent -m 1024:1024 -i 1000 -x 100 -f -z ; unset MPIR_CVAR_CH4_PROGRESS_THROTTLE ; module unload mpich-config/collective-tuning/1024

longfei-austin avatar Sep 29 '25 21:09 longfei-austin