kuberay
kuberay copied to clipboard
[Feature] allow submit rayjob when some worker nodes pending
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
when building a large scale of ray cluster in a rayjob, k8s may run out of resouces which causes some worker nodes pending.In this case, the ray cluster can already run job but rayjob think it is initializing, the job will wait for cluster util healthy. Can we add a available replicas to indicate that the cluster now can submit job even if cluster scale not meets the expected replicas , not just hang the job.
Use case
submit job as soon as possible not just hang to improve resouce usage in k8s
Related issues
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
cc @harryge00 @Jeffwan I'm personally a bit behind on the status of RayJob support.
It does seems reasonable to allow a job to get started before resources are fully provisioned.
I wonder also if a tool like https://github.com/IBM/multi-cluster-app-dispatcher could help with ensuring that resources are available before submitting a job.
It does seems reasonable to allow a job to get started before resources are fully provisioned.
Now in ray-cluster's WorkerGroupSpec, there have MinReplicas and MaxReplicas. This actually indicates that ray cluster could run with elastic size for one job.
This can be solved by #1631. Close this one.