kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature] Support batch scheduling and queueing

Open richardsliu opened this issue 2 years ago • 4 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

Kuberay currently does not seem to support scheduling policies for Ray clusters. Examples include:

  • Queue scheduling
  • Gang scheduling
  • Job preemption

Possible scheduler implementations include https://volcano.sh/en/ and https://github.com/kubernetes-sigs/kueue.

Use case

Example use cases:

  • Multiple users try to deploy Ray clusters in the same Kubernetes cluster. Not all of the requests can be granted due to resource constraints. It would be nice to have some requests stored in a queue and processed at a later time.
  • Clusters can have configurable retry policies
  • Clusters can be annotated with priority classes
  • Batch scheduling configurations: a Ray cluster only gets deployed if a minimum number of quorum nodes can be scheduled.

Related issues

No response

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

richardsliu avatar Mar 23 '22 20:03 richardsliu

That's an interesting use case.

What constitutes resource constraints? if a cluster needs 10 replicas (worker pods) but only 9 can be scheduled, is this a resource constraint? or if zero pods for a RayCluster can be scheduled is considered a resource constraint?

Currently the RayOperator has a reconciliation loop that tries to schedule any pods that are missing in a RayCluster. It fetches the list of RayClusters from K8s periodically.

Autoscaling (Cluster-Autoscaling) with K8s can be used to add more nodes when the Ray pods are in pending state due to lack of resources.

Having the Ray Operator keep a queue would make it stateful, and it is preferable to keep the state in K8s.

As for batch scheduling configuration, I think it is a very valid feature request, since, for example, if the user wants a RayCluster of 100 workers and gets a Cluster with 1 head node and zero workers, this might not be useful at all.

However adding this feature would mean that the Ray Operator should keep track of the available resources in the K8s Cluster, which would add to its complexity and the latency of processing each request to create a RayCluster.

akanso avatar Mar 23 '22 21:03 akanso

Thanks for the reply.

For the case with 100 workers - suppose that different users created two such clusters in the same Kubernetes cluster. Neither of them have sufficient workers, but both would still be competing for the same resources in the pool. This can lead to resource starvation.

If we have support for Gang Scheduling, this would be more optimal for resource utilization.

Would it be possible to add an option to integrate with an existing batch scheduler, like volcano.sh?

richardsliu avatar Mar 23 '22 22:03 richardsliu

I see what you mean. If we use a scheduling framework like volcano, we would introduce a dependency on the Volcano api: apiVersion: scheduling.volcano.sh Moreover, part of deploying KubeRay would involve deploying Volcano.

Do you have in mind a design where KubeRay users can chose wether or not to use this Volcano dependency? in other terms using gang scheduling becomes optional?

Today we have the option of using minReplicas, does it satisfy your use case, if the RayCluster pods are only created if and only if the minimum number of pods can be created?

 replicas: {{ .Values.worker.replicas }}
 minReplicas: {{ .Values.worker.miniReplicas | default 1 }}
 maxReplicas: {{ .Values.worker.maxiReplicas | default 2147483647 }}

akanso avatar Mar 24 '22 01:03 akanso

FYI, we have noticed the volcano does not work with Kubernetes version >=1.22 out of the box.

asm582 avatar Jun 02 '22 14:06 asm582

Here I can give a demo code which can support volcano in ray operator. https://github.com/loleek/kuberay/commit/745e12cb6b72d4299ec1b4bef65cb46d305eafc3 I get queue and priority class from head pod spec, but it works well to me.

loleek avatar Oct 28 '22 07:10 loleek

@loleek thanks for sharing!

I think there are several tools in the k8s ecosystem that can help with this. For example, @asm582 has demo'd and documented how to use MCAD to gang-schedule KubeRay-managed resources https://ray-project.github.io/kuberay/guidance/kuberay-with-MCAD/

cc @sihanwang41 @architkulkarni @kevin85421

DmitriGekhtman avatar Oct 28 '22 15:10 DmitriGekhtman

In my opinion, batch scheduling and queuing should not be directly supported out-of-the-box. However, clear docs on the use of external tools like https://ray-project.github.io/kuberay/guidance/kuberay-with-MCAD/ are much appreciated!

@loleek from a quick read of the branch you posted, it seems to me the same functionality could be achieved with the existing operator code, by correctly configuring a RayCluster CR.

Of course, having the functionality built-in is certainly nice and might be better for your use-case. But then on the other hand, we don't want to be too opinionated on what external schedulers to use.

It'd be great if we can figure out how to support using Volcano without modifying the operator code and then document that for the community.

DmitriGekhtman avatar Oct 28 '22 15:10 DmitriGekhtman

Configuring a RayCluster using minReplicas might not solve problems when statefule RayActors are involved under a preemption.

For example job x requires 10 actors, 1 actor per cpu. Job y also requires 10 actors, 1 actor per cpu. There are 15 cpus in total. If job y perempted job x, half of job x's actors lost, waiting for retry.

Actors retry serveral times might always failed because it's hard to predict job y running time. Actors retry infinitely might not be a good practise. Better let gang scheduler removes job x entirely and reschedules job x after job y finishes.

wolvever avatar Oct 31 '22 06:10 wolvever

@wolvever that's a great point, Ray doesn't have built-in gang scheduling and fault tolerance primitives that would mitigate this scenario. While Ray does have some built-in fault tolerance mechanisms, we generally recommend thinking about fault tolerance at the application level, i.e. you should consider various failure scenarios and design your Ray app with those in mind.

I think this issue focuses on gang-scheduling at the level of K8s pods, though, rather than at the level of the Ray internals.

DmitriGekhtman avatar Oct 31 '22 19:10 DmitriGekhtman

Note that https://github.com/ray-project/kuberay/pull/755 has landed, which adds support for Volcano. The interface has been designed in such a way to make it possible to add new schedulers without a major refactor.

tgaddair avatar Dec 02 '22 04:12 tgaddair

MCAD is also an option for batch scheduling functionality. https://www.anyscale.com/blog/gang-scheduling-ray-clusters-on-kubernetes-with-multi-cluster-app-dispatcher

Seems this is good to close. Feel free to open further discussion on batch scheduling and queueing!

DmitriGekhtman avatar Dec 02 '22 04:12 DmitriGekhtman

@DmitriGekhtman would MCAD work for long-lived clusters? i.e say we have a Ray Cluster and we submit a job to the cluster. This job tells the autoscaler to spawn 5 worker pods. Can MCAD be utilized in Ray's current state to watch this type of scenario? The original issue talks about queuing the instantiation of new Ray Clusters.

peterghaddad avatar Aug 21 '23 18:08 peterghaddad