Add support for Unified Collective Communication (UCC)
We saw that you’ve started with a proof-of-concept of how to use NCCL as a backend of MPI collectives. However, wouldn't a UCC (Unified Collective Communication) integration be more versatile and would also cover NCCL alongside other backends?
For estimating the effort for such an integration, we at ParTec have already started a prototypical UCC integration in the style of the existing HCOLL wrappers and we wonder if it might make sense to collaborate on this and/or to discuss the pros and cons of the different approaches.
As a starting point, we could use this Issue here. What do you think?
@mjwilkins18 FYI Parastation MPI folks are interested in your CCL frameworks design.
Hi @carsten-clauss, thanks for your interest! The proof of concept you linked is the first step in a broader effort to provide a framework for MPICH to integrate with standalone collective libraries. We plan to support UCC, NCCL, RCCL, and OneCCL initially and are open to additional suggestions. I am in the middle of revising the proof of concept to properly integrate CCL algorithms with the existing collective selection structure. Let me get back to you early next week once I have that pushed to discuss in more detail.
In the interim, I would be happy to see/hear more about what you are working on. Based on your description, it sounds like you are working at the device layer in MPICH. You mentioned that you are following the HCOLL wrappers, so does that mean you are tying your implementation specifically into the UCX device path? I believe there is benefit to a broader approach, since even vanilla NCCL is used on other devices these days (e.g., the aws-ofi-nccl plugin).
Hi @mjwilkins18, Thank you very much for your reply and also for your interest! And as already said, we at ParTec would be very happy to discuss this subject further and perhaps to collaborate on this.
In fact, you are right that our prototype is at the device level (mpid/common), although with our ADI3 device PSP (mpid/psp) of ParaStation MPI, we not only support UCX for point-to-point communication, but with our low-level communication library pscom also other communication paths (e.g. Portals/BXI).
For accelerated collectives, we currently only rely on the existing HCOLL wrappers in mpid/common. However, for the future we will definitely need NCCL support, too, which we hope to get, for example, through an UCC integration.
Do I understand your approach correctly, that you want to establish something similar to a "second device interface" dedicated for collectives and respective accelerators that is independent from the ADI3 device in use? If this continues to clearly separate the hardware-agnostic part of MPICH from the actual communication devices and drivers, then this would indeed be a very useful extension to MPICH from our point of view. :+1:
Looking forward to getting some more insights into your development plans here.
@carsten-clauss #7298 is now updated with the CCL framework design I have in mind. You can see how the prototype can be extended for additional CCLs such as UCC (and collectives beyond Allreduce).
Do I understand your approach correctly, that you want to establish something similar to a "second device interface" dedicated for collectives and respective accelerators that is independent from the ADI3 device in use?
I think you are correct. In my prototype, the CCL is invoked as an MPIR collective algorithm, avoiding the device layer entirely.
If this continues to clearly separate the hardware-agnostic part of MPICH from the actual communication devices and drivers
This is my goal! Let me know what you think and if you have any questions.
Hello everybody, please excuse the silence here for a while, but I think now that #7298 has been merged, it's time to resume also the discussion about how to tackle a full-fledged UCC integration.
As already mentioned above, we at ParTec had already started evaluating a prototypical UCC integration at the device level (i.e., in mpid/common, similar to hcoll), which is by now at least mature enough (from an architectural pov) that we would like to share it with you. For this, I have made the corresponding development branch of ParaStation MPI available for you here via my private GitHub fork:
https://github.com/carsten-clauss/psmpi/tree/cc/github/mr-draft-ucc-support
The main components of this UCC integration are the last 3-4 commits, which roughly do the following:
- The first adds wrappers for the UCC API in
mpid/common/ucc. - The second adds the calls of these bindings to ParTec's PSP device.
- And the third does the same for the UCX netmod of the CH4 device (just as a reference).
It would be great if you could take a look and give us some feedback on this approach. And perhaps it would make sense afterwards, for example in a brief online meeting, to discuss which path (or paths) we could take together to achieve a stable UCC integration for both ParaStation MPI and MPICH.
@mjwilkins18 What do you think?
Hi @carsten-clauss, no worries, thanks for following up! I will take a look at your code and provide feedback. I would be happy to setup an online meeting to discuss in more detail.
Hi @carsten-clauss, thanks again for this excellent prototype. I had a chance to take a brief look through the code. I plan to spend more time on it, but I have a few high-level questions:
- I saw you rely on
MPI_COMM_WORLDfor setup, which I agree with for a first implementation. In the future, how important do you think it is for your customers/apps to support UCC within the Sessions model? - What is the benefit of implementing a direct algorithm w/ point-to-point vs just mapping the oob allgather to mpir/mpid_iallgather?
- Do you view UCC as something you need in addition to hcoll, or could it potentially serve as a replacement?
I have also asked @raffenet and @hzhou to take a look at the code, since they have more experience at the device layer.
Like I mentioned previously, we are happy to setup an online call to discuss more. Next week is a shortened holiday week here in the US, so folks may be on vacation. Does the week after next (July 7-11) work well for you? I am available Monday, July 7 and Wednesday, July 9 in the morning US time/afternoon CET
Hi @mjwilkins18, Thank you very much for your feedback and as promised, here are my answers to your questions:
I saw you rely on MPI_COMM_WORLD for setup, which I agree with for a first implementation. In the future, how important do you think it is for your customers/apps to support UCC within the Sessions model?
That's a very good point! Since we at ParTec are currently working on malleability features based on MPI sessions, this is definitely an important issue for us that we still need to solve for a complete integration. For the moment, however, it was only important for us to see how a working UCC integration might look like, and for that, relying on the world model was easier and sufficient for us as a first draft.
What is the benefit of implementing a direct algorithm w/ point-to-point vs just mapping the oob allgather to mpir/mpid_iallgather?
Hmm, I guess there is no benefit, and I think it would most probably be better to follow your suggestion and to use the iallgather from the MPIR layer here -- and I have to admit that, for a quick draft implementation, we simply followed the pattern by Open MPI here.
Do you view UCC as something you need in addition to hcoll, or could it potentially serve as a replacement?
No, as I understand it, hcoll will (or even has already) become obsolete and will be replaced by UCC -- and I currently don't think that any of our customers will not go along with that.
I'm looking forward to talking to you offline about this further today!
Opened #7578 to further address this.