mpich osu_alltoall very slow at 128 nodes 96 ppn

Example on "current" image:

mpiexec  --np 12288 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102                      \
      /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/128/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-12_18-34-20/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoallv \
    -m 8:8 -i 1000 -x 100 -f -z 


\# OSU MPI All-to-Allv Personalized Exchange Latency Test v7.5
\# Datatype: MPI_CHAR.
\# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
8                 2336272.24        2303640.97        2361927.28        1000        2233719.36        3639945.50        4924726.11

mpiexec  --np 49152 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102                         \
    /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/512/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-12_18-34-20/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoall_persistent \
   -m 4096:4096 -i 1000 -x 100 -f -z

Example on "next-eval" image:

module load mpich-config/collective-tuning/1024 ; MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 ; mpiexec  --np 12288 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102                         \
    /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/128/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-13_12-00-10/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoallv \
    -m 1024:1024 -i 1000 -x 100 -f -z ; unset MPIR_CVAR_CH4_PROGRESS_THROTTLE ; module unload mpich-config/collective-tuning/1024


\# OSU MPI All-to-Allv Personalized Exchange Latency Test v7.5
\# Datatype: MPI_CHAR.
\# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
1024               557998.66         557652.96         558416.66        1000         552487.75         579047.39         601034.67

mpiexec  --np 49152 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102                        \
     /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/512/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-13_12-00-09/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoall_persistent \
    -m 4096:4096 -i 1000 -x 100 -f -z 

\# OSU MPI All-to-All Personalized Exchange Persistent Latency Test v7.5
\# Datatype: MPI_CHAR.
\# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
4096              4970324.00        4969234.36        4971451.83        1000        4564224.59        4741384.03       10566833.35
Sun 14 Sep 2025 03:12:24 AM UTC

Sep 29 '25 21:09 longfei-austin

@roblatham00 has been investigating MPI-IO collective aggregation performance as well, as have I, this is probably related. Is this only for high ppn or are you seeing this at say 16 ppn, as I have seen slow collective MPI-IO at just 16ppn.

Sep 30 '25 01:09 pkcoff

I will get back to you on this, if I forget or too slow, please yell ...

Sep 30 '25 02:09 longfei-austin

Mind take a look yourself? (Looks pretty slow to me.) Let me know if you can't access the file

cd /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/512/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-13_12-00-09/aurora/compute/PrgEnv-intel/RunMPIcollective

cat rfm_job.out | grep -E "ppn>12" -C 7

Sep 30 '25 04:09 longfei-austin

I took a look, taking the 4k message size and with the progress throttle on 512 nodes latency goes from 25 ms for 12 ppn to 327 ms for 96 ppn, so performance degrades linearly with the number of ranks, but I don't know if 25 ms for 12 ppn is good or bad. ROMIO uses alltoallv for exchanging arrays of offsets and lengths, do usually not a ton of data, then pt2pt for actually aggregating the data on the collective buffers.

Sep 30 '25 05:09 pkcoff

lol, not trying to be pedantic, that's actually worse than linear