mpich icon indicating copy to clipboard operation
mpich copied to clipboard

osu_alltoall very slow at 128 nodes 96 ppn

Open longfei-austin opened this issue 3 months ago • 13 comments

Example on "current" image:

mpiexec  --np 12288 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102                      \
      /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/128/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-12_18-34-20/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoallv \
    -m 8:8 -i 1000 -x 100 -f -z 


\# OSU MPI All-to-Allv Personalized Exchange Latency Test v7.5
\# Datatype: MPI_CHAR.
\# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
8                 2336272.24        2303640.97        2361927.28        1000        2233719.36        3639945.50        4924726.11

mpiexec  --np 49152 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102                         \
    /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/512/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-12_18-34-20/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoall_persistent \
   -m 4096:4096 -i 1000 -x 100 -f -z 

Example on "next-eval" image:

module load mpich-config/collective-tuning/1024 ; MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 ; mpiexec  --np 12288 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102                         \
    /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/128/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-13_12-00-10/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoallv \
    -m 1024:1024 -i 1000 -x 100 -f -z ; unset MPIR_CVAR_CH4_PROGRESS_THROTTLE ; module unload mpich-config/collective-tuning/1024


\# OSU MPI All-to-Allv Personalized Exchange Latency Test v7.5
\# Datatype: MPI_CHAR.
\# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
1024               557998.66         557652.96         558416.66        1000         552487.75         579047.39         601034.67

mpiexec  --np 49152 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102                        \
     /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/512/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-13_12-00-09/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoall_persistent \
    -m 4096:4096 -i 1000 -x 100 -f -z 

\# OSU MPI All-to-All Personalized Exchange Persistent Latency Test v7.5
\# Datatype: MPI_CHAR.
\# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
4096              4970324.00        4969234.36        4971451.83        1000        4564224.59        4741384.03       10566833.35
Sun 14 Sep 2025 03:12:24 AM UTC

longfei-austin avatar Sep 29 '25 21:09 longfei-austin

@roblatham00 has been investigating MPI-IO collective aggregation performance as well, as have I, this is probably related. Is this only for high ppn or are you seeing this at say 16 ppn, as I have seen slow collective MPI-IO at just 16ppn.

pkcoff avatar Sep 30 '25 01:09 pkcoff

I will get back to you on this, if I forget or too slow, please yell ...

longfei-austin avatar Sep 30 '25 02:09 longfei-austin

Mind take a look yourself? (Looks pretty slow to me.) Let me know if you can't access the file

cd /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/512/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-13_12-00-09/aurora/compute/PrgEnv-intel/RunMPIcollective

cat rfm_job.out | grep -E "ppn>12" -C 7

longfei-austin avatar Sep 30 '25 04:09 longfei-austin

I took a look, taking the 4k message size and with the progress throttle on 512 nodes latency goes from 25 ms for 12 ppn to 327 ms for 96 ppn, so performance degrades linearly with the number of ranks, but I don't know if 25 ms for 12 ppn is good or bad. ROMIO uses alltoallv for exchanging arrays of offsets and lengths, do usually not a ton of data, then pt2pt for actually aggregating the data on the collective buffers.

pkcoff avatar Sep 30 '25 05:09 pkcoff

lol, not trying to be pedantic, that's actually worse than linear

longfei-austin avatar Sep 30 '25 12:09 longfei-austin

Here are some alltoallv results (with less node count)

cd /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/RunMPIcollective;

cat rfm_job.out | grep -E "cpu-binary>osu_alltoallv-ppn>12-message_length>4096" -A 7;

cat rfm_job.out | grep -E "gpu-binary>osu_alltoallv-ppn>12-message_length>4096" -A 7;

longfei-austin avatar Sep 30 '25 12:09 longfei-austin

ref_collectives-node_count>32-target>cpu-binary>osu_alltoallv-ppn>12-message_length>4096-tune_level>0
mpiexec  --np 384 --ppn 12 --cpu-bind verbose,list:4:5:17:18:30:31:56:57:69:70:82:83                         /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoallv -m 4096:4096 -i 1000 -x 100 -f -z

# OSU MPI All-to-Allv Personalized Exchange Latency Test v7.5
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
4096                  467.38            460.43            474.05        1000            407.18            584.22            634.30
Sun 14 Sep 2025 04:55:38 PM UTC
--
ref_collectives-node_count>32-target>cpu-binary>osu_alltoallv-ppn>12-message_length>4096-tune_level>1
module load mpich-config/collective-tuning/1024 ; MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 ; mpiexec  --np 384 --ppn 12 --cpu-bind verbose,list:4:5:17:18:30:31:56:57:69:70:82:83                         /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoallv -m 4096:4096 -i 1000 -x 100 -f -z ; unset MPIR_CVAR_CH4_PROGRESS_THROTTLE ; module unload mpich-config/collective-tuning/1024

# OSU MPI All-to-Allv Personalized Exchange Latency Test v7.5
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
4096                  469.69            461.00            479.75        1000            432.49            584.41            636.54
Sun 14 Sep 2025 04:55:41 PM UTC

longfei-austin avatar Sep 30 '25 12:09 longfei-austin

ref_collectives-node_count>32-target>gpu-binary>osu_alltoallv-ppn>12-message_length>4096-tune_level>0
 mpiexec  --np 384 --ppn 12 --cpu-bind verbose,list:4:5:17:18:30:31:56:57:69:70:82:83 --gpu-bind verbose,list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoallv -m 4096:4096 -i 1000 -x 100 -f -z -d sycl 

# OSU MPI-SYCL All-to-Allv Personalized Exchange Latency Test v7.5
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
4096                 2424.84           2388.75           2486.50        1000           2422.49           2428.42           2512.23
Sun 14 Sep 2025 10:30:41 PM UTC
--
ref_collectives-node_count>32-target>gpu-binary>osu_alltoallv-ppn>12-message_length>4096-tune_level>1
module load mpich-config/collective-tuning/1024 ; MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 ; mpiexec  --np 384 --ppn 12 --cpu-bind verbose,list:4:5:17:18:30:31:56:57:69:70:82:83 --gpu-bind verbose,list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoallv -m 4096:4096 -i 1000 -x 100 -f -z -d sycl ; unset MPIR_CVAR_CH4_PROGRESS_THROTTLE ; module unload mpich-config/collective-tuning/1024

# OSU MPI-SYCL All-to-Allv Personalized Exchange Latency Test v7.5
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
4096                 2430.45           2390.39           2496.55        1000           2426.75           2432.90           2559.86
Sun 14 Sep 2025 10:30:45 PM UTC

longfei-austin avatar Sep 30 '25 12:09 longfei-austin

A sizable slowdown comparing gpu to cpu as well, but I guess that's lesser a concern for now

longfei-austin avatar Sep 30 '25 12:09 longfei-austin

I have a version of IOR that when running against DAOS in MPI-IO mode with collective buffering disabled it writes discontiguous data all over the file, essentially every rank writes a small chunk to every file domain, resulting in alot of communication overhead, and running at 128 nodes on the write for 16ppn there is about a 10x latency slowdown from the contiguous within a rank block write and at 64ppn this goes to about 30x latency slowdown. This communication is all RMA from the CN clients to the DAOS servers via the DAOS messaging layer over CXI, so I have a feeling the issue with these collective calls is all part of the same issue within CXI. I can put together a reproducer if you would like.

pkcoff avatar Oct 03 '25 20:10 pkcoff

A couple more details at the messaging level on my enhanced IOR slowdown. So we are using erasure encoding in DAOS with 128k cells, so in the regular contiguous block IOR for 16 and 64 ppn we are seeing 2.6 TB/second bandwidth where there is 1 RMA get from the DAOS server to a single client of 128k which is then written to the ssd. Contrast this to my enhanced IOR discontiguous block test where for 16ppn the bandwidth drops to 260 GB/second where there are 2 64k RMA get from 2 clients to the 1 RMA buffer on the DAOS server, and then for 64ppn the bandwidth drops to about 90 GB/second the only difference being now there are 8 16K RMA gets from 8 clients to the DAOS server RMA buffer.

pkcoff avatar Oct 04 '25 14:10 pkcoff

Actually I was off by a factor of 2 on the enhanced IOR RMA gets, for 16ppn the DAOS server does 4 32K RMA gets from 4 clients, for 64ppn the DAOS server does 16 8K RMA gets from 16 clients.

pkcoff avatar Oct 04 '25 15:10 pkcoff

One more data point, with collective buffering ON for my enhanced ior on the read where each collective buffer distributes data to all the ranks there is a 100x slowdown from the regular ior where the collective buffers send all their data to just 1 rank.

pkcoff avatar Oct 14 '25 21:10 pkcoff