ompi
ompi copied to clipboard
New collectives component: XPMEM-based Hierarchical Collectives (XHC)
Hello, This PR proposes the XPMEM-based Hierarchical Collectives (XHC) component.
XHC implements optimized intra-node (only) collectives according to a topology-aware n-level hierarchy, especially for nodes with high core counts and elaborate topologies (but not limited to). Supported topological features include NUMA nodes, CPU sockets, shared caches (based on hwloc through ompi's internal process-to-process distance mappings). The design is centered around communication using XPMEM, but Bcast also supports CMA and KNEM via opal/smsc. Currently Bcast, Allreduce and Barrier are supported. Reduce is implemented partially (and disabled by default).
XHC is also combinable with coll/HAN for multi-node runs.
- Some more information is included in the README inside the component's folder.
- You might also be interested in our related research paper:
A framework for hierarchical single-copy MPI collectives on multicore nodes
https://doi.org/10.1109/CLUSTER51413.2022.00024
Besides the main commit that adds the component, this PR includes two auxillary commits:
-
The 2nd commit, an allegedly mostly-trivial one, adds support in coll/HAN for selecting XHC as the component to use for the intra-comm.
-
The 3rd commit, which is a temporary hack, allows XHC's partially-implemented MPI_Reduce to be used for HAN's Allreduce. XHC's reduce is (for now) implemented as a sub-case of allreduce and requires that the rbuf param be valid for all ranks. I will understand if this commit is not desirable :-) (the reason for its existence would be alleged increased performance potential)
Below are some benchmarks from various nodes and operations:
Allreduce
Broadcast
Barrier
HAN+XHC Allreduce
HAN+XHC Broadcast
HAN+XHC Barrier
Can one of the admins verify this patch?
ok to test.
Your barrier graphs are missing legends or titles to be able to understand what they are showing. Different number of processes I assume ?
Ah sorry something went wrong there, I fixed them and updated the post. It's 3 different nodes (and yes different proc counts), and then one multi-node one across 48 nodes.
I might be missing something, but rather than the changes to HAN to support Reduce, why not save the previous Reduce function and call it if the parameters aren't sufficient for your implementation? An example of doing this is the sync collective, which doesn't actually implement any algorithms, but simply calls a barrier after calling the underlying algorithm for the non-synchronizing collectives.
I could do that with the root parameter, but from what I understand not with the recvbuf one. Even if it is non-NULL, I can't assume that it points to valid and appropriately sized memory. And I'd need to know that this is the case for all ranks (or at least the leaders if I recall correctly). Unless the standard says it has to be either NULL or fully valid (??).
In the future, I was thinking of implementing the intermediate buffers necessary on the leaders for the hierarchical reduction using a memory pool (not sure if opal/mpool does what I have in mind or not). This might then be further extended to have HAN's Allreduce register the rbuf to the pool as a "single-use" memory area, which would then be picked up by xhc's reduce when it requests a temporary buffer to use.
I could do that with the
rootparameter, but from what I understand not with therecvbufone. Even if it is non-NULL, I can't assume that it points to valid and appropriately sized memory. And I'd need to know that this is the case for all ranks (or at least the leaders if I recall correctly). Unless the standard says it has to be either NULL or fully valid (??).
I think you're right; that's a bummer. I'm not a huge fan of the coupling between xhc and han, especially to the level in the last patch, but maybe it is unavoidable.
In the future, I was thinking of implementing the intermediate buffers necessary on the leaders for the hierarchical reduction using a memory pool (not sure if opal/mpool does what I have in mind or not). This might then be further extended to have HAN's Allreduce register the rbuf to the pool as a "single-use" memory area, which would then be picked up by xhc's reduce when it requests a temporary buffer to use.
Other than the HAN/xhc coupling, that makes sense. The rbuf definition in the MPI standard is rather unfortunate, but such is life.
Thanks for the review and the initial comments. I addressed most of them with code changes. ~I added them as a new commit, as I imagined will make keeping track of the improvements easier (?). (to be squashed before merging)~
bot:ibm:retest
finally done with the review and :+1:. However, I would like to revisit the discussion above about the rbuf on reduce for non-root. I understand the interest, but I do not like the implementation. I understand the issue is arising from the MPI standard itself, but I would like to propose an OMPI-specific and sensible solution. in our upper-level implementation we switch the rbuf coming from the user to NULL on all non-root processes, to indicate that we don't know if it's use is safe. Thus, if a component receives a non-NULL buffer it will indicate that some other part of OMPI has allocated the memory and it is therefore safe to use. This will have no impact on existing collective components and will allow this component to remove the hack with the root. Thoughts ?
@bosilca I really like that solution.
I like it! I'll implement/test it on the XHC side and update the PR
The only caveat is that the root, for whom sbuf and rbuf are both non-NULL, won't know whether rbuf is present for other ranks or not. Looks like communication between root and at least one other rank is necessary in order for the root to determine whether to use the fallback reduce or not.
It's certainly possible and better than the other hack, so I'll think it over.
Part of the problem is the non-full implementation of reduce. In the future the decision shouldn't be whether to fall back to another component or not (based on the presence of rbuf), but rather whether to allocate a temporary buffer or not for non-roots. The latter does not require any rbuf-related action on the root's side.
I see, you need more than local knowledge about the buffers. My patch will definitely not provide that, it just give us all (aka implementers of collective algorithms) a better knowledge of what we can use locally.
Let me think a little about the buffer management, the cleaner approach would be a well-defined internal API for management of temporary buffers. IF you don't mind send me a message (email or slack) about the requirements you would need from such an API.
I got rid of the hacky approach for the HAN allreduce. XHC's reduce now keeps a sticky internal buffer that it manually allocates and grows in size when necessary. If rbuf is NULL (and thanks to #11552, a scenario where rbuf is non-NULL but is not valid is not possible), the internal buffer is used instead. During HAN's allreduce, in XHC's reduce, the rbuf will be present for all ranks, and in this case an internal rbuf won't need to be allocated. The only case in which XHC's reduce falls back to another component now is when root != 0, but this only requires local knowledge to determine.
This PR should ideally be merged after #11552. See also #11650.
I also applied a bunch of misc fixes and a small new feature for the barrier's root rank (it should all be in github's comparison).
Hello @bosilca, and others, is now a good time to look into completing the integration of this PR? From what I understand the only prerequisites are #11552 and #11650, correct? Can I assist in any way, e.g. with testing or code? I could look into putting together a PR to address the issue identified in #11650 (?). I'll also bring this PR up to speed with the latest main, and test to make sure all still works.
I performed some tests and measurements with this component, I think the performance improvements are very significant and we should try to merge this in the near future, before we branch for the next release. (Short message allreduce was the only area that the xhc component was slower than tuned, but I would suspect that this could be tuned as well). I can also provide the raw data instead of graphs if people prefer.
Performance Graphs
With #11552 and #11650 completed, I believe we are now ready to merge this? Is there something else pending?
Thanks for the benchmarks Edgar, I'll take a look regarding the short message allreduce. It's also possible that it has already been improved/fixed in upcoming versions, might ask you at some point if you can repeat the test!
There were some mpi4py CI failures, with the tests timing out (327, 328). They actually went away after I re-ran the CI twice. Kinda mysterious, since all changes introduced here are not expected to be utilized at all, unless specifically activated. I'll try to see if I can reproduce the failures on my machine. In the mean time, if there are any insights on the mpi4py CI and whether the failures are related to this PR or not...
wrt mpi4py failures, they look to be related to problems we are seeing with intercommunicators. intermittent. they are highly unlikely to be triggered by the work in this pr.
@gkatev We should have brought this to your attention earlier.
During our developer meeting last week. We had a constructive discussion about this PR. The consensus is that the change is rather safe as it does not alter default collective behaviors. However, from a code maintenance perspective, the community is severely under-staffed(we are all volunteers). We want to understand if your sponsor would be committed to maintaining XHC in the short to long term. This includes fixing bugs, as well as reviewing PRs from other contributors.
@gkatev I sent you an email ealier this week about the topics mentioned above but did not have received an answer yet.
@bosilca Sorry I didn't know that.
No problem, they either bounced back (the email from the XHC paper), and acted as a black hole (his github email). Let's hope direct poke @gkatev works better ;)
Hi, sorry for the delay in responding, I just sent a reply. (short answer: yes, we are able to support this work as necessary!)
(PS the email from the paper will stop bouncing quite soon...)
Please revise this PR to account for the merge of the init/fini PR (https://github.com/open-mpi/ompi/pull/12429)
ready