kuberay
kuberay copied to clipboard
[Feature] Make RayJob recover automatically from K8S submitter job and Ray cluster head node failures
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
I propose that K8S submitter pod and ray cluster head restarts are recovered automatically so that everything gets recreated, or K8S submitter pod is able to cope with its own and ray head restarts.
The proposal applies to K8sJobMode
.
Current behaviour is following:
- Restart of K8S Submitter job pod
- If backoff limit is reached, submitter job gets completed and raycluster continues running the Ray job. Ray cluster is deleted if configured in RayJob CR.
- If backoff limit is not reached, Ray job is submitted and it fails because submission ID exists and it is restarted until backoff is reached
- Restarts of Ray head POD
- If Ray cluster dashboard API recovers before submitter K8S job reaches backoff limit, Ray job is resubmitted and the RayJob recovers.
- If Ray dashboard does not recover in time, job fails
Use case
I would like to create RayJob entities which are able to recover automatically from node failures, emergency nodepool upgrades and scheduled maintenance operations.
Related issues
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
Minimal enhancement would be provide a way to control the K8S job submitter backoff limit. If I could set it to higher than 2 then I could write a submitterPodTemplate which:
- waits until raycluster is in ready state
- waits until dashboard is alive
- checks if there is job running with a submission id
- if yes, starts tailing the logs
- if not (re)creates the job
Good discussion on retry policy API for RayJob in https://github.com/ray-project/kuberay/pull/2091#discussion_r1573470644