ompi
ompi copied to clipboard
[WIP] A new shared memory collectives component
This PR adds coll/smdirect, a clone of coll/sm that relies on cross-process memory mapping as provided by XPMEM. In contrast to coll/sm, data is not copied into an intermediate buffer but buffers are registered with xpmem and the access keys exchanged. Processes can then copy data directly from the source to the target buffer. A similar synchronization mechanism using atomic flags is used to wait for data availability. We currently implemented broadcast, barrier, reduce, and allreduce (as a combination of reduce and bcast). Eventually, this component should replace the (apparently unmaintained) coll/sm component.
Below are some performance measurements on Hawk (2x64 core AMD EPYC system installed at HLRS), min/max/avg taken from OSU benchmarks: the new component shows good bandwidth for larger messages. For small messages, however, both coll/tuned and coll/sm show significantly lower minimum times (and thus lower average) due to buffering (intermediate buffer in coll/sm, eager messages in coll/tuned). This will be addressed in coll/smdirect in future work by buffering small messages in a pre-registered buffer. Interesting are the maximum times (the longest time any process spends in the collective), where coll/smdirect is competitive for reduce operations and provides significant improvements for large broadcasts. The current implementation of coll/allreduce

In the OSU benchmarks, the barrier implementation in coll/sm is faster (2.5us) than coll/smdirect (5.3us) since coll/sm is set up to partially overlap the execution of two consecutive barriers. Both are faster than coll/tuned (6.3us) though.
This PR is work in progress. Things to do:
- [ ] Add NUMA awareness.
- [ ] Add more collective operations and refine the allreduce implementation.
- [ ] Add small-data buffering, similar to
coll/sm, to improve latency. - [ ] Investigate the use of smsc components that do not provide mapping capabilities (cma, knem).
- [ ] Investigate interaction with
coll/han(for the node-local portion). - [ ] Cleanup code and commits.
This PR includes the fixes to smsc/xpmem in #10127 and needs to be rebased once that is merged.