Longfei Gao comments

Results 22 comments of


                                            Longfei Gao

osu_alltoall very slow at 128 nodes 96 ppn

I will get back to you on this, if I forget or too slow, please yell ...

osu_alltoall very slow at 128 nodes 96 ppn

Mind take a look yourself? (Looks pretty slow to me.) Let me know if you can't access the file cd /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/512/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-13_12-00-09/aurora/compute/PrgEnv-intel/RunMPIcollective cat rfm_job.out | grep -E "ppn>12" -C 7

osu_alltoall very slow at 128 nodes 96 ppn

lol, not trying to be pedantic, that's actually worse than linear

osu_alltoall very slow at 128 nodes 96 ppn

Here are some alltoallv results (with less node count) cd /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/RunMPIcollective; cat rfm_job.out | grep -E "cpu-binary>osu_alltoallv-ppn>12-message_length>4096" -A 7; cat rfm_job.out | grep -E "gpu-binary>osu_alltoallv-ppn>12-message_length>4096" -A 7;

osu_alltoall very slow at 128 nodes 96 ppn

``` ref_collectives-node_count>32-target>cpu-binary>osu_alltoallv-ppn>12-message_length>4096-tune_level>0 mpiexec --np 384 --ppn 12 --cpu-bind verbose,list:4:5:17:18:30:31:56:57:69:70:82:83 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoallv -m 4096:4096 -i 1000 -x 100 -f -z # OSU MPI All-to-Allv Personalized Exchange Latency Test v7.5 # Datatype: MPI_CHAR....

osu_alltoall very slow at 128 nodes 96 ppn

``` ref_collectives-node_count>32-target>gpu-binary>osu_alltoallv-ppn>12-message_length>4096-tune_level>0 mpiexec --np 384 --ppn 12 --cpu-bind verbose,list:4:5:17:18:30:31:56:57:69:70:82:83 --gpu-bind verbose,list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/alltoall-alltoall_persistent-alltoallv-alltoallv_persistent-alltoallw-alltoallw_persistent/stage/2025-09-14_11-29-57/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_alltoallv -m 4096:4096 -i 1000 -x 100 -f -z -d sycl # OSU MPI-SYCL All-to-Allv Personalized Exchange Latency Test...

osu_alltoall very slow at 128 nodes 96 ppn

A sizable slowdown comparing gpu to cpu as well, but I guess that's lesser a concern for now

osu_gather and osu_gather_persistent fails starting 8 nodes 96 ppn

On 8 nodes, 4 errors On 32 nodes, 8 errors On 128 nodes, 16 errors On 512 nodes, 52 errors So, might be related to the total number of ranks.

osu_gather and osu_gather_persistent fails starting 8 nodes 96 ppn

cd /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/8/gather-gather_persistent-gatherv-gatherv_persistent/stage/2025-09-13_12-50-22/aurora/compute/PrgEnv-intel/RunMPIcollective/ tail rfm_job.out

osu_gather/osu_scatter show large variance at 32 nodes 96 ppn

module load mpich-config/collective-tuning/1024 ; MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 ; mpiexec --np 3072 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/32/gather-gather_persistent-gatherv-gatherv_persistent/stage/2025-09-14_21-51-44/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_gather -m 4096:4096 -i 1000 -x 100 -f -z ; unset MPIR_CVAR_CH4_PROGRESS_THROTTLE ; module unload mpich-config/collective-tuning/1024 \#...