osu_allgather hang
The following command would reliably hang on Aurora current default image.
for i in {1..100}; do module load mpich-config/collective-tuning/1024; mpiexec --np 24 --ppn 12 --cpu-bind verbose,list:4:5:17:18:30:31:56:57:69:70:82:83 --gpu-bind verbose,list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1 <path/to/osu_allgather/binary> -m 4096:4096 -i 1000 -x 100 -f -z -d sycl ; module unload mpich-config/collective-tuning/1024; echo $i; done
*** It no longer hangs on next-eval image.
I reproduced this on Aurora. The tuning file forces all allgather calls thru composition alpha, which uses a shm allgather, followed by a multi-leader network allgather. When it hangs, I see that some processes are stuck in network allgather while others have moved on to the following timing barrier.
While I don't yet have a root cause, a few thoughts:
- The multi-leader comms constructed in the default module are manually coded and could be error-prone. There was substantial rewriting of these types of subcomm creations in 70f4621bc87ab4f6bcdd9ac79ae95b55b4421b5b that may have fixed those errors.
- 70f4621bc87ab4f6bcdd9ac79ae95b55b4421b5b also had a typo that I ran into during testing. Should be fixed in https://github.com/pmodels/mpich/pull/7614.
I can't reproduce it. I tried both the latest aurora_test branch and an old aurora branch (last commit 12/21/2024). Both didn't hang but I notice some how we significantly improved the latency -
- latest
aurora_test4096 - 40.93 us - old
aurora4096 - 112.50 us
EDIT:
Tried aurora branch with the latest commit in 07/02/2025, it didn't hang either with latency - 111.81 us
Actually, I was not properly testing the old aurora branch since that branch was built as libmpi.so.0.0.0 and the osu_gather was linked with mpich-develop-git.6037a7a-sxnhr7p/lib/libmpi.so.12 instead.
EDIT2: It is likely the performance improvements comes from Mike's collective tuning adjustment