kuberay
kuberay copied to clipboard
[Feature] Integrate KubeRay with YuniKorn
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
Currently, with YuniKorn Admission Controller, YuniKorn is able to schedule Ray jobs on K8s.
But without Admission Controller, if YuniKorn is not up and running (e.g. deploy, crash , etc), jobs without batchScheduler specified will be scheduled by default scheduler, which is not acceptable for production jobs since we are setting resources allocation for each queue and using other YuniKorn features.
YuniKorn Gang Scheduling is enabled for Spark Jobs, will add related configures in KubeRay to support task groups for ray workers.
This feature will include
- Integrate YuniKorn as batchScheduler for Ray jobs
- Support Gang Scheduling for Ray jobs using YuniKorn
cc @kevin85421
Use case
Use YuniKorn to schedule Ray jobs in prod EKS clusters. Enable Gang Scheduling to set GPU nodes in one group and CPU nodes in another group. Enable preemption and queue features (maxApp, guaranteed resources, max resources for CPU/GPU)
Related issues
YuniKorn side changes are tracked here: https://issues.apache.org/jira/browse/YUNIKORN-1907
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
Thanks, @lixmgl! Some tips to make the integration possible:
- We aim to maintain KubeRay's lightweight nature. Therefore, please minimize the addition of Yunikorn-specific code to the KubeRay core.
- Please avoid modifying the CRD as much as possible.
We are very much interested in this feature, thank you @lixmgl to bring this up!
@lixmgl are you still working on this? I have an initial PR for this, which I can share if you don't mind. Please let me know, thanks!