mpich icon indicating copy to clipboard operation
mpich copied to clipboard

Alltoallv hanging at 64 nodes on Aurora with newer MPICH

Open colleeneb opened this issue 5 months ago • 13 comments

This is just to report an issue from Aditya Nishtala (@aditya-nishtala), all credit to them.

Running OSU_alltoallv on 64 nodes at 1KB size will not complete even after waiting for an hour. This is an old issue appeared back in January and disappeared Mid May and now its back. But it is only back in the next-eval queue, which are running the 1146 Agama and MPICH from https://github.com/pmodels/mpich/tree/aurora-250825.

It doesn't happen on nodes running the production queue, which are running on 1099 Agama and mpich/opt/develop-git.6037a7a.

To workaround the issue you currently need to enable MPICH throttle that was put in version 5.0.0a1 like so: export MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 export MPIR_CVAR_CH4_PROGRESS_THROTTLE_NO_PROGRESS_COUNT=8192

The last time this problem appeared this workaround was built on a hypothesis that CPU was reading too fast and often preventing NICs from accessing. This work around is currently needed when running nodes in next-eval queue.

Currently this issue affects LAMMPS_SOW walltime and not known to impact performance. 128 LAMMPS_SOW without workaround: 399.08s 128 LAMMPS_SOW with workaround: 88.68s 128 LAMMPS_SOW on 1099 Agama: 87.11s

LAMMPS heavily uses alltoallv during setup, setting up the atoms, creating bonds and neighbor lists. Here is the timing info on those alltoallv calls

Context:

  • Current default: https://github.com/pmodels/mpich/commit/6037a7a7a3fe6b5a62b2896cd2894ef5dca6648f
  • next-eval prior to Sept 24: https://github.com/pmodels/mpich/tree/aurora-250825
  • next-eval after: 06f012a

colleeneb avatar Sep 24 '25 17:09 colleeneb

LAMMPS heavily uses alltoallv during setup, setting up the atoms, creating bonds and neighbor lists. Here is the timing info on those alltoallv calls

First Table below is time (in seconds) each alltoallv takes before workaround at 64 nodes.

Nodes Total              
  Replicate/Special Bonds Time 1st AlltoAllv inside Special 2nd AlltoAllv inside Special 3rd AlltoAllv inside Special 4th AlltoAllv inside Special 5th AlltoAllv inside Special 6th AlltoAllv inside Special 7th AlltoAllv inside Special
64 94.4057 7.33815 13.4984 12.2322 13.0263 12.6354 13.9459 12.0326

Second Table below is time (in seconds) each alltoallv takes after workaround at 64 nodes

Nodes Total              
  Replicate/Special Bonds Time 1st AlltoAllv inside Special 2nd AlltoAllv inside Special 3rd AlltoAllv inside Special 4th AlltoAllv inside Special 5th AlltoAllv inside Special 6th AlltoAllv inside Special 7th AlltoAllv inside Special
64 0.994117 0.196668 0.0741985 0.0173872 0.02093 0.0195717 0.0614534 0.0709564

aditya-nishtala avatar Sep 24 '25 17:09 aditya-nishtala

Possibly relevant from a previous thread on slack: ``` https://github.com/pmodels/mpich/commit/c07aabfa584ca63220ea538432045903b4f916df

ANL MPICH team also implemented a new algorithm for alltoallv which can be tested based on the env from this commit, https://github.com/pmodels/mpich/commit/6037a7a7a3fe6b5a62b2896cd2894ef5dca6648f ```

colleeneb avatar Sep 24 '25 18:09 colleeneb

Try using MPIR_CVAR_ALLTOALLV_PAIRWISE_NEW=1 to see if improved ALLTOALLV algo reduces need for progress throttle.

raffenet avatar Sep 24 '25 19:09 raffenet

I tried out MPIR_CVAR_ALLTOALLV_PAIRWISE_NEW=1 on mpich/opt/5.0.0.aurora_test.06f012a even after 40 mins, at 64 nodes for 1KB message size the run did not complete.

After disabling nohz_full on 64 specific nodes, osu_alltoallv is working fine on the 64 specific nodes. However on Sunspot, alltoallv was working fine even with nohz_full turned on

aditya-nishtala avatar Sep 30 '25 16:09 aditya-nishtala

So what i found out, Is on Sunspot there is no issue with alltoallv. Works just fine. On Aurora and Aurora only the issue pops even though sunspot uses the same image as the Aurora next-eval queue. On Aurora turning off nohz_full helps, but nohz_full is not the Culprit.

Here is what i found out in my last week of debugging. When alltoallv is working correctly

Image

It looks like this spending 70 to 80% of its exection time in the kernel space.

When alltoallv is not working correctly

Image

It looks like this ^ spending 99% of its execution time just spinning on a lock in the User Space.

Using the Throttle workaround for some reason make it work

Image

How it looks with throttle workaround ^ 50% user space and 50% system space.

Running the alltoallv at 64 nodes for 1 KB message size for exactly 1 iteration Aurora is able to match Sunspot's altoallv performance and both clusters stay consistent. The moment I go to 2 iterations on Aurora the performance varies any where been 100s of miliseconds to multiple seconds.

The data for 1 Iteration runs is as follows: OSU MPI All-to-Allv Personalized Exchange Latency Test v5.6.2 Size Avg Latency(us) 1024 228141.01 OSU MPI All-to-Allv Personalized Exchange Latency Test v5.6.2 Size Avg Latency(us) 1024 228840.02 OSU MPI All-to-Allv Personalized Exchange Latency Test v5.6.2 Size Avg Latency(us) 1024 227735.06 OSU MPI All-to-Allv Personalized Exchange Latency Test v5.6.2 Size Avg Latency(us) 1024 228491.49

The data for 2 iteration runs is as follows: OSU MPI All-to-Allv Personalized Exchange Latency Test v5.6.2 Size Avg Latency(us) 1024 649632.73 OSU MPI All-to-Allv Personalized Exchange Latency Test v5.6.2 Size Avg Latency(us) 1024 983503.00 OSU MPI All-to-Allv Personalized Exchange Latency Test v5.6.2 Size Avg Latency(us) 1024 861259.03

Or if I increase the message from 1KB to 2KB and still keep the iteration to 1, the performance once again varies between miliseconds to multiple seconds.

There is some relation to between Total Cluster Size, time spent in kernel space, total data transfer size (via iterations or high message size) that is triggering this slowdown in alltoallv and only in alltoallv.

Apparently alltoallv specifically has very different behavior than the other collectives in terms executing in user space vs system space.

aditya-nishtala avatar Oct 06 '25 20:10 aditya-nishtala

Forgot to add, current on Aurora next-eval there are 2 MPICH version available. mpich/opt/5.0.0.aurora_test.06f012a (the default loaded) and mpich/opt/4.3.1

right now lammps sow only retains its performance and hits it's performance target with the 4.3.1 which doesnt have the Throttle option available

aditya-nishtala avatar Oct 06 '25 21:10 aditya-nishtala

Today we came across 64 "good" nodes the exhibit the alltoallv problem at a significantly less amount The good nodes are x4006c2s3b0n0 x4006c2s4b0n0 x4006c2s5b0n0 x4006c2s6b0n0 x4006c2s7b0n0 x4006c3s0b0n0 x4006c3s1b0n0 x4006c3s2b0n0 x4006c3s3b0n0 x4006c3s4b0n0 x4006c3s5b0n0 x4006c3s6b0n0 x4006c3s7b0n0 x4006c4s0b0n0 x4006c4s1b0n0 x4006c4s2b0n0 x4006c4s3b0n0 x4006c4s5b0n0 x4006c4s6b0n0 x4006c4s7b0n0 x4006c5s0b0n0 x4006c5s1b0n0 x4006c5s2b0n0 x4006c5s3b0n0 x4006c5s4b0n0 x4006c5s7b0n0 x4006c6s0b0n0 x4006c6s1b0n0 x4006c6s2b0n0 x4006c6s3b0n0 x4006c6s4b0n0 x4006c6s5b0n0 x4006c6s6b0n0 x4006c6s7b0n0 x4006c7s0b0n0 x4006c7s1b0n0 x4006c7s2b0n0 x4006c7s4b0n0 x4006c7s5b0n0 x4006c7s6b0n0 x4006c7s7b0n0 x4007c0s0b0n0 x4007c0s1b0n0 x4007c0s2b0n0 x4007c0s3b0n0 x4007c0s4b0n0 x4007c0s5b0n0 x4007c0s6b0n0 x4007c0s7b0n0 x4007c1s1b0n0 x4007c1s2b0n0 x4007c1s3b0n0 x4007c1s5b0n0 x4007c1s6b0n0 x4007c1s7b0n0 x4007c2s0b0n0 x4007c2s1b0n0 x4007c2s2b0n0 x4007c2s3b0n0 x4007c2s4b0n0 x4007c2s5b0n0 x4007c2s6b0n0 x4007c2s7b0n0 x4007c3s0b0n0

Spans across 2 cabinets x4006 and x4007 and actually passed osu_alltoallv microbenchmark. OSU MPI All-to-Allv Personalized Exchange Latency Test v5.6.2 Size Avg Latency(us) 1024 1548925.82

Took forever, like 30 mins total walltime to finish all the iterations.

And in LAMMPS Todays lammps alltoallv times Special Build inbuf Alltoallv Time = 1.6963 secs Special Build inbuf Alltoallv Time = 1.74761 secs Special Build outbuf Alltoallv Time = 0.0117448 secs Special Build inbuf Alltoallv Time = 0.094026 secs Special Build outbuf Alltoallv Time = 0.886424 secs Special Build inbuf Alltoallv Time = 0.852533 secs Special Build outbuf Alltoallv Time = 0.0644331 secs

Image

Even the HTOP showed good behavior.

Whereas previously the times for alltoallv within LAMMPS were Special Build inbuf Alltoallv Time = 9.96126 secs Special Build inbuf Alltoallv Time = 12.7 secs Special Build outbuf Alltoallv Time = 11.6219 secs Special Build inbuf Alltoallv Time = 13.2233 secs Special Build outbuf Alltoallv Time = 12.5721 secs Special Build inbuf Alltoallv Time = 14.2236 secs Special Build outbuf Alltoallv Time = 12.8218 secs

I then got a second set of 64 nodes to see if it was just these specific 64 nodes x4116c3s6b0n0 x4116c3s7b0n0 x4116c4s1b0n0 x4116c4s2b0n0 x4116c5s0b0n0 x4116c7s4b0n0 x4119c4s6b0n0 x4119c4s7b0n0 x4119c5s0b0n0 x4119c5s1b0n0 x4119c5s2b0n0 x4119c5s3b0n0 x4111c5s5b0n0 x4111c6s5b0n0 x4002c6s2b0n0 x4014c6s2b0n0 x4017c2s3b0n0 x4111c2s0b0n0 x4119c3s6b0n0 x4013c3s5b0n0 x4015c5s0b0n0 x4103c6s2b0n0 x4006c5s5b0n0 x4007c3s4b0n0 x4013c1s4b0n0 x4013c3s1b0n0 x4013c4s1b0n0 x4015c2s5b0n0 x4015c7s6b0n0 x4016c7s3b0n0 x4100c0s7b0n0 x4100c3s6b0n0 x4103c2s5b0n0 x4103c3s7b0n0 x4103c4s4b0n0 x4103c4s5b0n0 x4105c1s4b0n0 x4107c3s7b0n0 x4107c4s2b0n0 x4107c4s3b0n0 x4118c0s7b0n0 x4118c1s4b0n0 x4118c3s1b0n0 x4003c2s4b0n0 x4003c3s1b0n0 x4004c2s1b0n0 x4004c7s0b0n0 x4008c4s3b0n0 x4011c0s7b0n0 x4011c1s6b0n0 x4018c4s0b0n0 x4108c2s3b0n0 x4112c5s7b0n0 x4113c1s4b0n0 x4115c7s7b0n0 x4116c6s1b0n0 x4117c0s7b0n0 x4003c6s2b0n0 x4014c1s5b0n0 x4104c2s1b0n0 x4108c0s6b0n0 x4108c6s7b0n0 x4005c5s3b0n0 x4120c5s3b0n0

This one spawns across multiple cabinets

This ones has exhibits the horrible alltoallv issues Special Build inbuf Alltoallv Time = 15.1371 secs Special Build inbuf Alltoallv Time = 12.7006 secs Special Build outbuf Alltoallv Time = 13.2 secs Special Build inbuf Alltoallv Time = 13.4127 secs Special Build outbuf Alltoallv Time = 12.7712 secs Special Build inbuf Alltoallv Time = 13.0849 secs Special Build outbuf Alltoallv Time = 13.8368 secs

and the osu_alltoallv doesnt finish and HTOP is also showing incorrect behavior where CPU spends time only in user space

Image

Could this really be a fabric/environment related issue? Going to test again with the tier 1 flag and try to get all 64 nodes within 1 cabinet. I have done this testing before but it had no positive impact.

aditya-nishtala avatar Oct 09 '25 17:10 aditya-nishtala

As per my previous comment both sets of 64 nodes are from next-eval and they both have nohz_full on all nodes within the set. this means that even tho turning of nohz_full helped on aurora, todays data proves that nohz_full is not the culprit for the alltoallv problem.

aditya-nishtala avatar Oct 09 '25 17:10 aditya-nishtala

If MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 alleviates the issue, then it is the CPU/NIC memory contention issue.

hzhou avatar Oct 10 '25 14:10 hzhou

It is a puzzle to me why different set of nodes behave differently, and sunspot never had the issue.

hzhou avatar Oct 10 '25 14:10 hzhou

That said, I think with effort, it may be possible to tune the progress throttle algorithm to minimize its impact on normal performance and also work around the CPU/NIC contention issue.

hzhou avatar Oct 10 '25 14:10 hzhou

MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 alleviates the issue, then it is the CPU/NIC memory contention

I don't believe this is a hardware related issue. On Sunspot and Borealis this problem doesn't exist and yet only exists on Aurora. Sunspot and Aurora have the same software and hardware stack it only happens on Aurora and its not even consistent across Aurora either. It also doesnt happen on the Prod queue (as of now Prod queue is 1099 AGAMA), it only happens on next-eval (as of now next-eval is 1146). MPICH versions doesnt matter either

The only difference between Aurora and Sunspot is the scale of the system. This "scale" combined with certain software is causing an issue.

The throttle workaround is simply is a workaround for apps to continue their progress for hitting performance target, but the real issue must be identified and fixed.

aditya-nishtala avatar Oct 10 '25 15:10 aditya-nishtala

Conclusion from offline discussion, somehow turning off the no_hz (https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt) hides the progress contention issue. I think the normal Scheduling-Clock Ticks interrupts the processes's busy polling on progress thus reducing the pressure and letting NIC update the event flag. In hpc we want to no_hz for best performance. Thus, moving forward, I think we need progress throttle on by default. The concern is whether it have negative performance impact.

MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 inserts usleep when a process is in a progress-loop and it has not made progress over MPIR_CVAR_CH4_PROGRESS_THROTTLE_NO_PROGRESS_COUNT (default 4096) rounds. https://github.com/pmodels/mpich/pull/7380

hzhou avatar Oct 10 '25 21:10 hzhou