Usleep Progress Throttle w/ No Progress Counter
Pull Request Description
This PR is an enhanced version of the progress changes from https://github.com/pmodels/mpich/pull/7368.
https://github.com/pmodels/mpich/pull/7368 introduces MPIR_CVAR_CH4_PROGRESS_THROTTLE, which, when enabled, adds usleep(1) to the progress loop, preventing cache thrashing issues at high PPN on Aurora.
This PR enhances the original design by also adding MPIR_CVAR_CH4_PROGRESS_THROTTLE_NO_PROGRESS_COUNT.
We count how many consecutive progress pools have failed to make progress, and only enable the usleep(1) when the value is greater than MPIR_CVAR_CH4_PROGRESS_THROTTLE_NO_PROGRESS_COUNT. This avoids a slowdown in cases where the usleep is not necessary to avoid cache thrashing.
The default value for MPIR_CVAR_CH4_PROGRESS_THROTTLE_NO_PROGRESS_COUNT was selected empirically using 64n/96ppn allreduce tests on Aurora. The data is shown in the figure below. 4096 provides the best balance between preserving good performance for the small message sizes while minimizing the hump that appears when the count is first reached.
Author Checklist
- [ ] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [ ] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit. - [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [ ] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.
Is there a way to find out if MPICH chose this workaround ? I am imaging some way of saying to... somebody "hey we ran these six benchmarks and hit the stalling workaround on 5 of them, requiring backoff X thousand times".
Is there a way to find out if MPICH chose this workaround ? I am imaging some way of saying to... somebody "hey we ran these six benchmarks and hit the stalling workaround on 5 of them, requiring backoff X thousand times".
We can add a PVAR - https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report/node423.htm
Can that be extracted without making any code changes? CVARS take environment variables... do PVARS set them on the way out?
Can that be extracted without making any code changes? CVARS take environment variables... do PVARS set them on the way out?
I guess we can dump selected PVARS at MPI_Finalize, for example, when MPIR_CVAR_DEBUG_SUMMARY is on.
Cray has MPICH_OFI_CXI_COUNTER_REPORT , which sounds like exactly the thing I'm looking for ( https://cpe.ext.hpe.com/docs/24.03/mpt/mpich/intro_mpi.html#libfabric-environment-variables-for-hpe-slingshot-nic-slingshot-11 ) I guess that's not something they upstreamed?
This PR is picked by #7368