coll: Added recursive multiplying algorithm
Pull Request Description
Adding the recursive multiplying allreduce algorithm. This algorithm achieves better performance than existing algorithms. They were designed with DOE's new exascale machines (Frontier, Aurora, El Capitan) in mind.
The algorithm is based on recursive doubling, which by only doubling the amount of data sent in each round, can induce unnecessary latency with many rounds of small message sizes. The recursive multiplying algorithm balances the latency and bandwidth trade-off through the number/size of rounds. Recursive multiplying introduces parameter k that controls the number of communication partners each round. For each round i, every process exchanges data between k − 1 other processes spaced a multiple of ki−1 apart, with the specific pairings chosen by dividing p processes into ki groups.
The red line in the following graph shows performance vs. recursive doubling on ORNL Frontier (128 nodes, 1ppn). For small message sizes, we see an average speedup of 20%, maxing out at 80%.
For more details see: https://paragon.cs.northwestern.edu/papers/2023-IEEECluster-Collectives-Wilkins.pdf
This is my first MPICH pr in a while, so please just let me know if there's anything I should change :smile:
Author Checklist
- [ ] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [ ] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit. - [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [ ] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.
Hi @mjwilkins18 , thanks for contributing!
Some suggestions to help the PR review -
- Could you add more introduction and details to the PR description? It's good that you provided the paper reference. If you could snip some relevant text, figures, performance data to this PR, it will be significantly more convenient for reviewers and future reference.
- Add a commit message describe the changes in addition to the subject line
- Add a test -- it may be as simple as adding a line in
test/mpi/maint/coll_cvars.txt
Hi @mjwilkins18 , thanks for contributing!
Some suggestions to help the PR review -
- Could you add more introduction and details to the PR description? It's good that you provided the paper reference. If you could snip some relevant text, figures, performance data to this PR, it will be significantly more convenient for reviewers and future reference.
- Add a commit message describe the changes in addition to the subject line
- Add a test -- it may be as simple as adding a line in
test/mpi/maint/coll_cvars.txt
Hi @hzhou, thanks for the tips! I made the changes you request, please let me know if there is anything else.
test:mpich/ch3/most test:mpich/ch4/most
@hzhou thanks so much for the thorough review! I think I addressed everything, let me know if there is anything else or anything I missed
test:mpich/ch3/most test:mpich/ch4/most
test:mpich/ch3/most test:mpich/ch4/most
test:mpich/ch3/most test:mpich/ch4/most
test:mpich/ch3/most test:mpich/ch4/most
@mjwilkins18 Approved. Go ahead rebase to main and merge