aibrix
aibrix copied to clipboard
Support for LWS cluster instead of RayClusterFleet
🚀 Feature Description and Motivation
I'm proposing that aibrix support a resource cluster based on LWS(Leader Worker set) and start ray cluster within that using start up scripts) as mentioned in lws docs so that we can then use a scheduler to enable topological placement of worker pods on nodes based on nvlink/numa alignment etc. This would help us to achieve higher performance by enabling feature like gpu-to-gpu communication over nvlink when doing distributed inference. This can also help replace ray with any other framework in future. This will be a substitute for the RayClusterFleet
Use Case
Deploy a distributed inference serving by making sure that the worker pods for the ray clusters are placed on nvlink/numa aligned vm's so that GPU-to-GPU communication is enabled using RDMA.
Proposed Solution
within that using start up scripts) as mentioned in lws docs so that we can then use a scheduler to enable topological placement of worker pods on nodes based on nvlink/numa alignment etc. This would help us to achieve higher performance by enabling feature like gpu-to-gpu communication over nvlink when doing distributed inference. This can also help replace ray with any other framework in future. This will be a substitute for the RayClusterFleet
@vivekskrishna Hi, thanks for your feedback, there're two issues we can discuss here.
-
Even using RayClusterFleet, the communication is still optimal here. RDMA is being used by nccl and ray doesn't play role of communication, only process and rank orchestration. If you like to enable the topological placement, it supposed to be done at the time pod are scheduled, I do not see difference using rayclusterfleet or LWS. could I know the scheduler you are using? may be we can extend the support there.
-
We do plan to open source our internal cloud native solution, but that's still LWS equivalent, we do not use sts underneath but manage pods ourselves which give more flexibility. Let me know if that way works. I think in your case, you just want to remove overhead of ray
Hi @Jeffwan , in our case we want to make sure that when the pods for a ray clusters are placed they are placed based on some topological constraint so that i can place them all as close as possible. we are planning to use LWS+ volcano or kueue for this which will help in making sure we can build a topological map and then place the pods based on constraint. there is support being added in lws to support volcano scheduler for this purpose.https://github.com/kubernetes-sigs/lws/issues/497(this also enable use of volcano topological placement).
If there is a way we can place the pods of ray cluster in same topological way(similar to how volcano does ) via another method would be glad to try it out and provide feedback.
Also eventually we are also exploring how to avoid using ray during distributed inference.
@vivekskrishna we will provide a cloud native solution soon, the main purpose is to cover P/D pooling and xPyD mode, it will also cover the general TP or PP multi-host case. I will keep you posted and involve you in that review. Directly using LWS is another option but the statefulset way is not that easy to extend to support more complex orchestrations but we will leave the door open at this moment
great. thank you @Jeffwan
@vivekskrishna can you check this solution? https://aibrix.readthedocs.io/latest/designs/aibrix-stormservice.html We will use this controller to manage multi-node and P/D cases in future
Stormservice is the primary orchestration offering in aibrix, It can be used to support P/D and take the role of RayClusterFleet as well. We will not adopt LWS at this moment. I will close this issue and feel free to give more feedback if anything is missing is StormService