mpich icon indicating copy to clipboard operation
mpich copied to clipboard

osu_igather Other MPI error starting at 8 nodes 96 ppn

Open longfei-austin opened this issue 3 months ago • 5 comments

Errors of the following signature have been encountered when running osu_igatherv and osu_igather:

Fatal error in internal_Wait: Other MPI error
Fatal error in internal_Barrier: Other MPI error
x4003c4s1b0n0.hsn.cm.aurora.alcf.anl.gov: rank 45283 died from signal 6
x4101c5s6b0n0.hsn.cm.aurora.alcf.anl.gov: rank 256 died from signal 11
x4101c5s6b0n0.hsn.cm.aurora.alcf.anl.gov: rank 257 died from signal 15

They are more commonly encountered with next-eval, start to happen at 8 nodes.

longfei-austin avatar Sep 30 '25 16:09 longfei-austin

@longfei-austin 8 nodes at what PPN?

hzhou avatar Oct 07 '25 18:10 hzhou

I am using 8 nodes and the aurora_test branch, running osu_igather - it works for me all the way up to 96 ppn.

hzhou avatar Oct 07 '25 18:10 hzhou

from next-eval queue result (8 nodes)

cd /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/8/igather-igatherv/stage/2025-09-13_17-53-51/aurora/compute/PrgEnv-intel/RunMPIcollective

awk 'BEGIN{N=7} {if(prev~/Lat\(us\)/&&/Sun/){for(i=NR-N;i<NR;i++)if(i>0)print buffer[i%N];print $0;count=1} else if(count>0){print "\n\n";count--} buffer[NR%N]=$0; prev=$0}' rfm_job.out 
 mpiexec  --np 768 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 --gpu-bind verbose,list:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.1:0.1:0.1:0.1:0.1:0.1:0.1:0.1:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.1:1.1:1.1:1.1:1.1:1.1:1.1:1.1:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.1:2.1:2.1:2.1:2.1:2.1:2.1:2.1:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.1:3.1:3.1:3.1:3.1:3.1:3.1:3.1:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.1:4.1:4.1:4.1:4.1:4.1:4.1:4.1:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.1:5.1:5.1:5.1:5.1:5.1:5.1:5.1 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/8/igather-igatherv/stage/2025-09-13_17-53-51/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_igather -m 4096:4096 -i 1000 -x 100 -f -z -d sycl 

# OSU MPI-SYCL Non-blocking Gather Latency Test v7.5
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Datatype: MPI_CHAR.
# Size           Overall(us)       Compute(us)    Coll. Init(us)      MPI_Test(us)      MPI_Wait(us)    Pure Comm.(us)     Min Comm.(us)     Max Comm.(us)        Overlap(%)  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
Sun 14 Sep 2025 12:38:56 AM UTC



module load mpich-config/collective-tuning/1024 ; MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 ; mpiexec  --np 768 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 --gpu-bind verbose,list:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.1:0.1:0.1:0.1:0.1:0.1:0.1:0.1:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.1:1.1:1.1:1.1:1.1:1.1:1.1:1.1:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.1:2.1:2.1:2.1:2.1:2.1:2.1:2.1:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.1:3.1:3.1:3.1:3.1:3.1:3.1:3.1:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.1:4.1:4.1:4.1:4.1:4.1:4.1:4.1:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.1:5.1:5.1:5.1:5.1:5.1:5.1:5.1 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/8/igather-igatherv/stage/2025-09-13_17-53-51/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_igather -m 4096:4096 -i 1000 -x 100 -f -z -d sycl ; unset MPIR_CVAR_CH4_PROGRESS_THROTTLE ; module unload mpich-config/collective-tuning/1024

# OSU MPI-SYCL Non-blocking Gather Latency Test v7.5
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Datatype: MPI_CHAR.
# Size           Overall(us)       Compute(us)    Coll. Init(us)      MPI_Test(us)      MPI_Wait(us)    Pure Comm.(us)     Min Comm.(us)     Max Comm.(us)        Overlap(%)  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
Sun 14 Sep 2025 12:39:02 AM UTC

longfei-austin avatar Oct 09 '25 18:10 longfei-austin

from next-eval queue result (32 nodes)

cd /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/igather-igatherv/stage/2025-09-13_17-53-50/aurora/compute/PrgEnv-intel/RunMPIcollective
awk 'BEGIN{N=7} {if(prev~/Lat\(us\)/&&/Sun/){for(i=NR-N;i<NR;i++)if(i>0)print buffer[i%N];print $0;count=1} else if(count>0){print "\n\n";count--} buffer[NR%N]=$0; prev=$0}' rfm_job.out 
 mpiexec  --np 3072 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 --gpu-bind verbose,list:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.1:0.1:0.1:0.1:0.1:0.1:0.1:0.1:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.1:1.1:1.1:1.1:1.1:1.1:1.1:1.1:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.1:2.1:2.1:2.1:2.1:2.1:2.1:2.1:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.1:3.1:3.1:3.1:3.1:3.1:3.1:3.1:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.1:4.1:4.1:4.1:4.1:4.1:4.1:4.1:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.1:5.1:5.1:5.1:5.1:5.1:5.1:5.1 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/igather-igatherv/stage/2025-09-13_17-53-50/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_igather -m 1024:1024 -i 1000 -x 100 -f -z -d sycl 

# OSU MPI-SYCL Non-blocking Gather Latency Test v7.5
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Datatype: MPI_CHAR.
# Size           Overall(us)       Compute(us)    Coll. Init(us)      MPI_Test(us)      MPI_Wait(us)    Pure Comm.(us)     Min Comm.(us)     Max Comm.(us)        Overlap(%)  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
Sun 14 Sep 2025 01:42:46 AM UTC



module load mpich-config/collective-tuning/1024 ; MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 ; mpiexec  --np 3072 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 --gpu-bind verbose,list:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.1:0.1:0.1:0.1:0.1:0.1:0.1:0.1:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.1:1.1:1.1:1.1:1.1:1.1:1.1:1.1:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.1:2.1:2.1:2.1:2.1:2.1:2.1:2.1:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.1:3.1:3.1:3.1:3.1:3.1:3.1:3.1:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.1:4.1:4.1:4.1:4.1:4.1:4.1:4.1:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.1:5.1:5.1:5.1:5.1:5.1:5.1:5.1 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/igather-igatherv/stage/2025-09-13_17-53-50/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_igather -m 1024:1024 -i 1000 -x 100 -f -z -d sycl ; unset MPIR_CVAR_CH4_PROGRESS_THROTTLE ; module unload mpich-config/collective-tuning/1024

# OSU MPI-SYCL Non-blocking Gather Latency Test v7.5
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Datatype: MPI_CHAR.
# Size           Overall(us)       Compute(us)    Coll. Init(us)      MPI_Test(us)      MPI_Wait(us)    Pure Comm.(us)     Min Comm.(us)     Max Comm.(us)        Overlap(%)  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
Sun 14 Sep 2025 01:42:53 AM UTC



 mpiexec  --np 3072 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 --gpu-bind verbose,list:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.1:0.1:0.1:0.1:0.1:0.1:0.1:0.1:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.1:1.1:1.1:1.1:1.1:1.1:1.1:1.1:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.1:2.1:2.1:2.1:2.1:2.1:2.1:2.1:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.1:3.1:3.1:3.1:3.1:3.1:3.1:3.1:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.1:4.1:4.1:4.1:4.1:4.1:4.1:4.1:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.1:5.1:5.1:5.1:5.1:5.1:5.1:5.1 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/igather-igatherv/stage/2025-09-13_17-53-50/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_igather -m 4096:4096 -i 1000 -x 100 -f -z -d sycl 

# OSU MPI-SYCL Non-blocking Gather Latency Test v7.5
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Datatype: MPI_CHAR.
# Size           Overall(us)       Compute(us)    Coll. Init(us)      MPI_Test(us)      MPI_Wait(us)    Pure Comm.(us)     Min Comm.(us)     Max Comm.(us)        Overlap(%)  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
Sun 14 Sep 2025 01:43:00 AM UTC



module load mpich-config/collective-tuning/1024 ; MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 ; mpiexec  --np 3072 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 --gpu-bind verbose,list:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.1:0.1:0.1:0.1:0.1:0.1:0.1:0.1:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.1:1.1:1.1:1.1:1.1:1.1:1.1:1.1:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.1:2.1:2.1:2.1:2.1:2.1:2.1:2.1:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.1:3.1:3.1:3.1:3.1:3.1:3.1:3.1:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.1:4.1:4.1:4.1:4.1:4.1:4.1:4.1:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.1:5.1:5.1:5.1:5.1:5.1:5.1:5.1 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/igather-igatherv/stage/2025-09-13_17-53-50/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_igather -m 4096:4096 -i 1000 -x 100 -f -z -d sycl ; unset MPIR_CVAR_CH4_PROGRESS_THROTTLE ; module unload mpich-config/collective-tuning/1024

# OSU MPI-SYCL Non-blocking Gather Latency Test v7.5
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Datatype: MPI_CHAR.
# Size           Overall(us)       Compute(us)    Coll. Init(us)      MPI_Test(us)      MPI_Wait(us)    Pure Comm.(us)     Min Comm.(us)     Max Comm.(us)        Overlap(%)  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
Sun 14 Sep 2025 01:43:07 AM UTC

longfei-austin avatar Oct 09 '25 18:10 longfei-austin

The mpich version should be mpich/opt/5.0.0.aurora.250825

longfei-austin avatar Oct 09 '25 19:10 longfei-austin