kuberay
kuberay copied to clipboard
[Feature] Support batch scheduling and queueing
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
Kuberay currently does not seem to support scheduling policies for Ray clusters. Examples include:
- Queue scheduling
- Gang scheduling
- Job preemption
Possible scheduler implementations include https://volcano.sh/en/ and https://github.com/kubernetes-sigs/kueue.
Use case
Example use cases:
- Multiple users try to deploy Ray clusters in the same Kubernetes cluster. Not all of the requests can be granted due to resource constraints. It would be nice to have some requests stored in a queue and processed at a later time.
- Clusters can have configurable retry policies
- Clusters can be annotated with priority classes
- Batch scheduling configurations: a Ray cluster only gets deployed if a minimum number of quorum nodes can be scheduled.
Related issues
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
That's an interesting use case.
What constitutes resource constraints? if a cluster needs 10 replicas (worker pods) but only 9 can be scheduled, is this a resource constraint? or if zero pods for a RayCluster can be scheduled is considered a resource constraint?
Currently the RayOperator has a reconciliation loop that tries to schedule any pods that are missing in a RayCluster. It fetches the list of RayClusters from K8s periodically.
Autoscaling (Cluster-Autoscaling) with K8s can be used to add more nodes when the Ray pods are in pending state due to lack of resources.
Having the Ray Operator keep a queue would make it stateful, and it is preferable to keep the state in K8s.
As for batch scheduling configuration, I think it is a very valid feature request, since, for example, if the user wants a RayCluster of 100 workers and gets a Cluster with 1 head node and zero workers, this might not be useful at all.
However adding this feature would mean that the Ray Operator should keep track of the available resources in the K8s Cluster, which would add to its complexity and the latency of processing each request to create a RayCluster.
Thanks for the reply.
For the case with 100 workers - suppose that different users created two such clusters in the same Kubernetes cluster. Neither of them have sufficient workers, but both would still be competing for the same resources in the pool. This can lead to resource starvation.
If we have support for Gang Scheduling, this would be more optimal for resource utilization.
Would it be possible to add an option to integrate with an existing batch scheduler, like volcano.sh?
I see what you mean.
If we use a scheduling framework like volcano, we would introduce a dependency on the Volcano api:
apiVersion: scheduling.volcano.sh
Moreover, part of deploying KubeRay would involve deploying Volcano.
Do you have in mind a design where KubeRay users can chose wether or not to use this Volcano dependency? in other terms using gang scheduling becomes optional?
Today we have the option of using minReplicas
, does it satisfy your use case, if the RayCluster pods are only created if and only if the minimum number of pods can be created?
replicas: {{ .Values.worker.replicas }}
minReplicas: {{ .Values.worker.miniReplicas | default 1 }}
maxReplicas: {{ .Values.worker.maxiReplicas | default 2147483647 }}
FYI, we have noticed the volcano does not work with Kubernetes version >=1.22 out of the box.
Here I can give a demo code which can support volcano in ray operator. https://github.com/loleek/kuberay/commit/745e12cb6b72d4299ec1b4bef65cb46d305eafc3 I get queue and priority class from head pod spec, but it works well to me.
@loleek thanks for sharing!
I think there are several tools in the k8s ecosystem that can help with this. For example, @asm582 has demo'd and documented how to use MCAD to gang-schedule KubeRay-managed resources https://ray-project.github.io/kuberay/guidance/kuberay-with-MCAD/
cc @sihanwang41 @architkulkarni @kevin85421
In my opinion, batch scheduling and queuing should not be directly supported out-of-the-box. However, clear docs on the use of external tools like https://ray-project.github.io/kuberay/guidance/kuberay-with-MCAD/ are much appreciated!
@loleek from a quick read of the branch you posted, it seems to me the same functionality could be achieved with the existing operator code, by correctly configuring a RayCluster CR.
Of course, having the functionality built-in is certainly nice and might be better for your use-case. But then on the other hand, we don't want to be too opinionated on what external schedulers to use.
It'd be great if we can figure out how to support using Volcano without modifying the operator code and then document that for the community.
Configuring a RayCluster using minReplicas might not solve problems when statefule RayActors are involved under a preemption.
For example job x requires 10 actors, 1 actor per cpu. Job y also requires 10 actors, 1 actor per cpu. There are 15 cpus in total. If job y perempted job x, half of job x's actors lost, waiting for retry.
Actors retry serveral times might always failed because it's hard to predict job y running time. Actors retry infinitely might not be a good practise. Better let gang scheduler removes job x entirely and reschedules job x after job y finishes.
@wolvever that's a great point, Ray doesn't have built-in gang scheduling and fault tolerance primitives that would mitigate this scenario. While Ray does have some built-in fault tolerance mechanisms, we generally recommend thinking about fault tolerance at the application level, i.e. you should consider various failure scenarios and design your Ray app with those in mind.
I think this issue focuses on gang-scheduling at the level of K8s pods, though, rather than at the level of the Ray internals.
Note that https://github.com/ray-project/kuberay/pull/755 has landed, which adds support for Volcano. The interface has been designed in such a way to make it possible to add new schedulers without a major refactor.
MCAD is also an option for batch scheduling functionality. https://www.anyscale.com/blog/gang-scheduling-ray-clusters-on-kubernetes-with-multi-cluster-app-dispatcher
Seems this is good to close. Feel free to open further discussion on batch scheduling and queueing!
@DmitriGekhtman would MCAD work for long-lived clusters? i.e say we have a Ray Cluster and we submit a job to the cluster. This job tells the autoscaler to spawn 5 worker pods. Can MCAD be utilized in Ray's current state to watch this type of scenario? The original issue talks about queuing the instantiation of new Ray Clusters.