ompi icon indicating copy to clipboard operation
ompi copied to clipboard

New collectives component: XPMEM-based Hierarchical Collectives (XHC)

Open gkatev opened this issue 2 years ago • 17 comments

Hello, This PR proposes the XPMEM-based Hierarchical Collectives (XHC) component.

XHC implements optimized intra-node (only) collectives according to a topology-aware n-level hierarchy, especially for nodes with high core counts and elaborate topologies (but not limited to). Supported topological features include NUMA nodes, CPU sockets, shared caches (based on hwloc through ompi's internal process-to-process distance mappings). The design is centered around communication using XPMEM, but Bcast also supports CMA and KNEM via opal/smsc. Currently Bcast, Allreduce and Barrier are supported. Reduce is implemented partially (and disabled by default).

XHC is also combinable with coll/HAN for multi-node runs.

  • Some more information is included in the README inside the component's folder.
  • You might also be interested in our related research paper: A framework for hierarchical single-copy MPI collectives on multicore nodes
    https://doi.org/10.1109/CLUSTER51413.2022.00024

Besides the main commit that adds the component, this PR includes two auxillary commits:

  1. The 2nd commit, an allegedly mostly-trivial one, adds support in coll/HAN for selecting XHC as the component to use for the intra-comm.

  2. The 3rd commit, which is a temporary hack, allows XHC's partially-implemented MPI_Reduce to be used for HAN's Allreduce. XHC's reduce is (for now) implemented as a sub-case of allreduce and requires that the rbuf param be valid for all ranks. I will understand if this commit is not desirable :-) (the reason for its existence would be alleged increased performance potential)

Below are some benchmarks from various nodes and operations:

Allreduce

allreduce-ampere allreduce-dp-cn allreduce-dp-esb allreduce-taos allreduce-vader

Broadcast

bcast-ampere bcast-taos bcast-titan bcast-vader

Barrier

barrier-ampere barrier-taos barrier-titan

HAN+XHC Allreduce

dp-cn-allreduce-48x dp-dam-allreduce-10x dp-esb-allreduce-10x

HAN+XHC Broadcast

dp-cn-bcast-48x dp-dam-bcast-10x dp-esb-bcast-10x

HAN+XHC Barrier

dp-cn-barrier-48x

gkatev avatar Feb 16 '23 15:02 gkatev

Can one of the admins verify this patch?

ompiteam-bot avatar Feb 16 '23 15:02 ompiteam-bot

ok to test.

bosilca avatar Feb 16 '23 15:02 bosilca

Your barrier graphs are missing legends or titles to be able to understand what they are showing. Different number of processes I assume ?

bosilca avatar Feb 16 '23 15:02 bosilca

Ah sorry something went wrong there, I fixed them and updated the post. It's 3 different nodes (and yes different proc counts), and then one multi-node one across 48 nodes.

gkatev avatar Feb 16 '23 21:02 gkatev

I might be missing something, but rather than the changes to HAN to support Reduce, why not save the previous Reduce function and call it if the parameters aren't sufficient for your implementation? An example of doing this is the sync collective, which doesn't actually implement any algorithms, but simply calls a barrier after calling the underlying algorithm for the non-synchronizing collectives.

bwbarrett avatar Feb 16 '23 22:02 bwbarrett

I could do that with the root parameter, but from what I understand not with the recvbuf one. Even if it is non-NULL, I can't assume that it points to valid and appropriately sized memory. And I'd need to know that this is the case for all ranks (or at least the leaders if I recall correctly). Unless the standard says it has to be either NULL or fully valid (??).

In the future, I was thinking of implementing the intermediate buffers necessary on the leaders for the hierarchical reduction using a memory pool (not sure if opal/mpool does what I have in mind or not). This might then be further extended to have HAN's Allreduce register the rbuf to the pool as a "single-use" memory area, which would then be picked up by xhc's reduce when it requests a temporary buffer to use.

gkatev avatar Feb 17 '23 06:02 gkatev

I could do that with the root parameter, but from what I understand not with the recvbuf one. Even if it is non-NULL, I can't assume that it points to valid and appropriately sized memory. And I'd need to know that this is the case for all ranks (or at least the leaders if I recall correctly). Unless the standard says it has to be either NULL or fully valid (??).

I think you're right; that's a bummer. I'm not a huge fan of the coupling between xhc and han, especially to the level in the last patch, but maybe it is unavoidable.

In the future, I was thinking of implementing the intermediate buffers necessary on the leaders for the hierarchical reduction using a memory pool (not sure if opal/mpool does what I have in mind or not). This might then be further extended to have HAN's Allreduce register the rbuf to the pool as a "single-use" memory area, which would then be picked up by xhc's reduce when it requests a temporary buffer to use.

Other than the HAN/xhc coupling, that makes sense. The rbuf definition in the MPI standard is rather unfortunate, but such is life.

bwbarrett avatar Feb 17 '23 16:02 bwbarrett

Thanks for the review and the initial comments. I addressed most of them with code changes. ~I added them as a new commit, as I imagined will make keeping track of the improvements easier (?). (to be squashed before merging)~

gkatev avatar Feb 22 '23 09:02 gkatev

bot:ibm:retest

jjhursey avatar Feb 22 '23 15:02 jjhursey

finally done with the review and :+1:. However, I would like to revisit the discussion above about the rbuf on reduce for non-root. I understand the interest, but I do not like the implementation. I understand the issue is arising from the MPI standard itself, but I would like to propose an OMPI-specific and sensible solution. in our upper-level implementation we switch the rbuf coming from the user to NULL on all non-root processes, to indicate that we don't know if it's use is safe. Thus, if a component receives a non-NULL buffer it will indicate that some other part of OMPI has allocated the memory and it is therefore safe to use. This will have no impact on existing collective components and will allow this component to remove the hack with the root. Thoughts ?

bosilca avatar Mar 31 '23 19:03 bosilca

@bosilca I really like that solution.

bwbarrett avatar Mar 31 '23 19:03 bwbarrett

I like it! I'll implement/test it on the XHC side and update the PR

gkatev avatar Apr 01 '23 08:04 gkatev

The only caveat is that the root, for whom sbuf and rbuf are both non-NULL, won't know whether rbuf is present for other ranks or not. Looks like communication between root and at least one other rank is necessary in order for the root to determine whether to use the fallback reduce or not.

It's certainly possible and better than the other hack, so I'll think it over.

Part of the problem is the non-full implementation of reduce. In the future the decision shouldn't be whether to fall back to another component or not (based on the presence of rbuf), but rather whether to allocate a temporary buffer or not for non-roots. The latter does not require any rbuf-related action on the root's side.

gkatev avatar Apr 01 '23 10:04 gkatev

I see, you need more than local knowledge about the buffers. My patch will definitely not provide that, it just give us all (aka implementers of collective algorithms) a better knowledge of what we can use locally.

Let me think a little about the buffer management, the cleaner approach would be a well-defined internal API for management of temporary buffers. IF you don't mind send me a message (email or slack) about the requirements you would need from such an API.

bosilca avatar Apr 02 '23 15:04 bosilca

I got rid of the hacky approach for the HAN allreduce. XHC's reduce now keeps a sticky internal buffer that it manually allocates and grows in size when necessary. If rbuf is NULL (and thanks to #11552, a scenario where rbuf is non-NULL but is not valid is not possible), the internal buffer is used instead. During HAN's allreduce, in XHC's reduce, the rbuf will be present for all ranks, and in this case an internal rbuf won't need to be allocated. The only case in which XHC's reduce falls back to another component now is when root != 0, but this only requires local knowledge to determine.

This PR should ideally be merged after #11552. See also #11650.

I also applied a bunch of misc fixes and a small new feature for the barrier's root rank (it should all be in github's comparison).

gkatev avatar May 05 '23 11:05 gkatev

Hello @bosilca, and others, is now a good time to look into completing the integration of this PR? From what I understand the only prerequisites are #11552 and #11650, correct? Can I assist in any way, e.g. with testing or code? I could look into putting together a PR to address the issue identified in #11650 (?). I'll also bring this PR up to speed with the latest main, and test to make sure all still works.

gkatev avatar Jan 29 '24 17:01 gkatev

I performed some tests and measurements with this component, I think the performance improvements are very significant and we should try to merge this in the near future, before we branch for the next release. (Short message allreduce was the only area that the xhc component was slower than tuned, but I would suspect that this could be tuned as well). I can also provide the raw data instead of graphs if people prefer.

Performance Graphs

xhc_osu_bcast_64

xhc_osu_reduce_64

xhc_osu_allreduce_64

edgargabriel avatar Feb 09 '24 20:02 edgargabriel

With #11552 and #11650 completed, I believe we are now ready to merge this? Is there something else pending?

Thanks for the benchmarks Edgar, I'll take a look regarding the short message allreduce. It's also possible that it has already been improved/fixed in upcoming versions, might ask you at some point if you can repeat the test!

gkatev avatar Apr 03 '24 08:04 gkatev

There were some mpi4py CI failures, with the tests timing out (327, 328). They actually went away after I re-ran the CI twice. Kinda mysterious, since all changes introduced here are not expected to be utilized at all, unless specifically activated. I'll try to see if I can reproduce the failures on my machine. In the mean time, if there are any insights on the mpi4py CI and whether the failures are related to this PR or not...

gkatev avatar Apr 03 '24 19:04 gkatev

wrt mpi4py failures, they look to be related to problems we are seeing with intercommunicators. intermittent. they are highly unlikely to be triggered by the work in this pr.

hppritcha avatar Apr 03 '24 19:04 hppritcha

@gkatev We should have brought this to your attention earlier.

During our developer meeting last week. We had a constructive discussion about this PR. The consensus is that the change is rather safe as it does not alter default collective behaviors. However, from a code maintenance perspective, the community is severely under-staffed(we are all volunteers). We want to understand if your sponsor would be committed to maintaining XHC in the short to long term. This includes fixing bugs, as well as reviewing PRs from other contributors.

wenduwan avatar May 01 '24 18:05 wenduwan

@gkatev I sent you an email ealier this week about the topics mentioned above but did not have received an answer yet.

bosilca avatar May 01 '24 18:05 bosilca

@bosilca Sorry I didn't know that.

wenduwan avatar May 01 '24 19:05 wenduwan

No problem, they either bounced back (the email from the XHC paper), and acted as a black hole (his github email). Let's hope direct poke @gkatev works better ;)

bosilca avatar May 01 '24 19:05 bosilca

Hi, sorry for the delay in responding, I just sent a reply. (short answer: yes, we are able to support this work as necessary!)

(PS the email from the paper will stop bouncing quite soon...)

gkatev avatar May 01 '24 19:05 gkatev

Please revise this PR to account for the merge of the init/fini PR (https://github.com/open-mpi/ompi/pull/12429)

bosilca avatar May 01 '24 19:05 bosilca

ready

gkatev avatar May 01 '24 21:05 gkatev