ray [ADAG] Refactor nccl to communicator channel.

Why are these changes needed?

This is try to enable the ADAG channel can use different hardware while the user API keeps the same. For user side, they can use transport='nccl' or transport='hccl' for different hardware. While internally, ADAG will treat them the nodes that needs the communicator. So for the complied and channel level, it just rename the nccl to communicator. In the bottom level, the nccl_group or hccl_group will be called to achieve the hardware level communicator. The API and logical above the torch_tensor_communicator_channel.py previously torch_tensor_nccl_channel.py keeps the same.

So when new accelerator will be used, what they need is just implement xccl_group, with same API for all groups. Then we can use new hardware.

RFC doc https://docs.google.com/document/d/1zu9SllrEAjPHqs-eeITtrSSbv0rBxtkyCJeweZJl100/edit?usp=sharing

Related issue number

Checks

[√] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[√] I've run scripts/format.sh to lint the changes in this PR.
[ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
[√] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(

Sep 27 '24 19:09 Bye-legumes

@ruisearch42 @rkooo567 Can you take a look on this and also https://github.com/ray-project/ray/pull/47658. Thx! Currently I just keep the same api at the top level so we can use transport = ''hccl or transport = ''nccl while for future hardware we just need to implement xccl_goup

Oct 09 '24 17:10 Bye-legumes

Hi @Bye-legumes , this changes quite a bit of the interface. Perhaps it's better to set up a meeting with @rkooo567 and me to discuss about your high level plans and timelines for NPU support. Can we set up something next week?

Also cc @stephanie-wang @anyscalesam in case you have any thoughts on this change and future ones.

Oct 10 '24 19:10 ruisearch42

Hi @Bye-legumes , this changes quite a bit of the interface. Perhaps it's better to set up a meeting with @rkooo567 and me to discuss about your high level plans and timelines for NPU support. Can we set up something next week?

Also cc @stephanie-wang @anyscalesam in case you have any thoughts on this change and future ones.

Sure, you can set up a meeting next week, I am available all the time!

Oct 10 '24 19:10 Bye-legumes

Can you update the PR description with the changes included? If it is mainly renaming internal APIs from nccl -> communicator, that's probably okay.

For adding a new transport, the preferred method for now would probably be to pass a custom communicator during compilation, which is being worked on in #47540.

Oct 10 '24 20:10 stephanie-wang

Can you update the PR description with the changes included? If it is mainly renaming internal APIs from nccl -> communicator, that's probably okay.

For adding a new transport, the preferred method for now would probably be to pass a custom communicator during compilation, which is being worked on in #47540.

Thx for your reply! , it;s just some renaming internal APIs from nccl -> communicator and also some hardware checking during complied stage. As if we need new Hardware to use xccl, the compile will always check GPU currently. So that is my motivation so for future hardware, we just need to implement the xccl_group and add hardware checking at compile stages.

Oct 10 '24 20:10 Bye-legumes

Thx for your reply! , it;s just some renaming internal APIs from nccl -> communicator and also some hardware checking during complied stage. As if we need new Hardware to use xccl, the compile will always check GPU currently. So that is my motivation so for future hardware, we just need to implement the xccl_group and add hardware checking at compile stages.

I see, thanks! Does the PR include the change for hardware checking?

I think in general we can accept the API internal renaming right away, but as @ruisearch42 said, it would be best if we can discuss the long-term plan for xccl support. One question I have is whether we need to add "xccl" as a possible transport type or if just using/extending the custom communicator interface would be sufficient. Could you put together a short RFC doc on this so the we can discuss with the OSS community?

Oct 12 '24 00:10 stephanie-wang

Thx for your reply! , it;s just some renaming internal APIs from nccl -> communicator and also some hardware checking during complied stage. As if we need new Hardware to use xccl, the compile will always check GPU currently. So that is my motivation so for future hardware, we just need to implement the xccl_group and add hardware checking at compile stages.

I see, thanks! Does the PR include the change for hardware checking?

I think in general we can accept the API internal renaming right away, but as @ruisearch42 said, it would be best if we can discuss the long-term plan for xccl support. One question I have is whether we need to add "xccl" as a possible transport type or if just using/extending the custom communicator interface would be sufficient. Could you put together a short RFC doc on this so the we can discuss with the OSS community?

Yeah, add some resource checking at channel level
I am not sure we use "xccl" directly at higher level or not(Of course I think we can achieve this by checking the resource, then make decision about which xccl should be use. Currently I just keep the transport at higher level to be the same that we specify "hccl" or "nccl" then do the resource check.
To make extension of different HW, I think the level above the python/ray/experimental/channel/torch_tensor_communicator_channel.py can keeps the same logic (just rename requires_nccl to requires_communicator) then we just keep use same API for different xccl_goup (init, send, receive, destory). I think that's enough. That is : 1. Keep logic unchanged for the codes above python/ray/experimental/channel/torch_tensor_communicator_channel.py 2. Same API for xccl_goup. To add new resource, we need to add the resource check at chennel python/ray/experimental/channel/torch_tensor_type.py then implement a xccl_group that with same API.

Oct 15 '24 18:10 Bye-legumes

propose another PR later

Nov 06 '24 19:11 Bye-legumes