kuberay [Feature] Make RayJob recover automatically from K8S submitter job and Ray cluster head node failures

[Feature] Make RayJob recover automatically from K8S submitter job and Ray cluster head node failures

Open jrosti opened this issue 1 year ago • 4 comments

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

Description

I propose that K8S submitter pod and ray cluster head restarts are recovered automatically so that everything gets recreated, or K8S submitter pod is able to cope with its own and ray head restarts.

The proposal applies to K8sJobMode.

Current behaviour is following:

Restart of K8S Submitter job pod

If backoff limit is reached, submitter job gets completed and raycluster continues running the Ray job. Ray cluster is deleted if configured in RayJob CR.
If backoff limit is not reached, Ray job is submitted and it fails because submission ID exists and it is restarted until backoff is reached

Restarts of Ray head POD

If Ray cluster dashboard API recovers before submitter K8S job reaches backoff limit, Ray job is resubmitted and the RayJob recovers.
If Ray dashboard does not recover in time, job fails

Use case

I would like to create RayJob entities which are able to recover automatically from node failures, emergency nodepool upgrades and scheduled maintenance operations.

Related issues

No response

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

Feb 04 '24 14:02 jrosti

Minimal enhancement would be provide a way to control the K8S job submitter backoff limit. If I could set it to higher than 2 then I could write a submitterPodTemplate which:

waits until raycluster is in ready state
waits until dashboard is alive
checks if there is job running with a submission id
- if yes, starts tailing the logs
- if not (re)creates the job

Feb 04 '24 14:02 jrosti

Good discussion on retry policy API for RayJob in https://github.com/ray-project/kuberay/pull/2091#discussion_r1573470644

Apr 22 '24 20:04 andrewsykim

kuberay kuberay copied to clipboard

[Feature] Make RayJob recover automatically from K8S submitter job and Ray cluster head node failures

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

kuberay
kuberay copied to clipboard