coll: Circulant graph queued variable ring bcast algorithm
This algorithm achieves better performance than existing bcast algorithms for both small and large message sizes.
The algorithm is based on the circulant graph abstraction and Jesper Larsson Traff's recent paper: https://dl.acm.org/doi/full/10.1145/3735139. It creates communication schedules around various rings in the circulant graph, then repeats the schedule to pipeline message chunks. We introduce a FIFO queue for overlapping sends and receives across communication rounds, which particularly benefits small messages.
In the graph below, we show the algorithm's performance for a fixed chunk size (256k) and queue length (24) for various scales on ANL Aurora (N, PPN). The baseline for this graph is the best-performing algorithm currently in MPICH, so all speedups represent improvements over all algorithms currently in the library. We note that the performance drops around our selected chunk size (256k). By tuning the chunk size near this message size, it is possible to achieve a speedup across all message sizes for all scales.
Pull Request Description
Author Checklist
- [ ] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [ ] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit. - [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [ ] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.