kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature] allow submit rayjob when some worker nodes pending

Open loleek opened this issue 2 years ago • 4 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

when building a large scale of ray cluster in a rayjob, k8s may run out of resouces which causes some worker nodes pending.In this case, the ray cluster can already run job but rayjob think it is initializing, the job will wait for cluster util healthy. Can we add a available replicas to indicate that the cluster now can submit job even if cluster scale not meets the expected replicas , not just hang the job.

Use case

submit job as soon as possible not just hang to improve resouce usage in k8s

Related issues

No response

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

loleek avatar Sep 06 '22 11:09 loleek

cc @harryge00 @Jeffwan I'm personally a bit behind on the status of RayJob support.

DmitriGekhtman avatar Sep 07 '22 04:09 DmitriGekhtman

It does seems reasonable to allow a job to get started before resources are fully provisioned.

DmitriGekhtman avatar Sep 14 '22 17:09 DmitriGekhtman

I wonder also if a tool like https://github.com/IBM/multi-cluster-app-dispatcher could help with ensuring that resources are available before submitting a job.

DmitriGekhtman avatar Sep 14 '22 17:09 DmitriGekhtman

It does seems reasonable to allow a job to get started before resources are fully provisioned.

Now in ray-cluster's WorkerGroupSpec, there have MinReplicas and MaxReplicas. This actually indicates that ray cluster could run with elastic size for one job.

loleek avatar Sep 15 '22 03:09 loleek

This can be solved by #1631. Close this one.

kevin85421 avatar Jun 29 '24 23:06 kevin85421