[Feature] Update RayJob DeletionPolicy API to differentiate success/failure scenarios
Search before asking
- [x] I had searched in the issues and found no similar feature requirement.
Description
In KubeRay v1.3 we added a DeletionPolicy API for RayJob as an alpha feature: https://github.com/ray-project/kuberay/pull/2643
The current API looks like this:
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: ..
spec:
deletionPolicy: DeleteCluster
...
However, we've received feedback that the deletion policy API would be more useful if it supported different deletion policies based on success or failure of the RayJob. For example, if a job succeeds the RayCluster should always be deleted, but if it fails, we may want to preserve only the Head pod for debugging. Supporting this would require a change to the current API, maybe something like the following:
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: ..
spec:
deletionPolicy:
onSuccess: DeleteCluster
onFailure: DeleteWorkers
Given the feature is still in alpha status (not enabled by default), we could break this API before graduating the feature to beta.
Use case
Change the deletion policy for Ray clusters created by RayJob based on the status of the job
Related issues
No response
Are you willing to submit a PR?
- [x] Yes I am willing to submit a PR!
@weizhaowz are you available to work on this?
@davidxia @kevin85421 thoughts on this change?
and RayJob.spec.ttlSecondsAfterFinished will apply to both?
@davidxia i think so, alternatively we can have TTLs per scenario like we discussed on slack:
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: ..
spec:
deletionPolicy:
onSuccess:
policy: DeleteCluster
ttlDuration: 1h
onFailure:
policy: DeleteWorkers
ttlDuration: 7d
@andrewsykim that works better for me. A middle ground can be just making onSuccess and onFailure objects with only policy fields to allow for the addition of more fields like ttlDuration in the future if more users find it useful.
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: ..
spec:
deletionPolicy:
onSuccess:
policy: DeleteCluster
onFailure:
policy: DeleteWorkers
Hi @kevin85421 @andrewsykim @davidxia @weizhaowz ,
We have a strong use case for this feature, as it would allow our users to configure a TTL for Ray Head Node used for debugging.
I was wondering about the future plans for this enhancement. Based on my testing with KubeRay 1.3.2, it appears that when DeletionPolicy is used, ttlSecondsAfterFinished does not seem to affect the associated RayCluster, causing it to persist. Please let me know if my understanding is incorrect.
If adding TTL support aligns with the project's roadmap, I would be happy to contribute a PR to help move this forward.
Thank you so much for your time!
@seanlaii thanks for the feedback -- I had thought that ttlSecondsAfterFinished should still be respected when using DeletionPolicy https://github.com/ray-project/kuberay/blob/v1.3.2/ray-operator/controllers/ray/rayjob_controller.go#L363-L367. If this is not the case, we should investigate that bug separately.
We also previously discussed adding TTL field per strategy. Like this one:
spec:
deletionPolicy:
onSuccess:
policy: DeleteCluster
ttl: 10m
onFailure:
policy: DeleteWorkers
ttl: 1h
Would that meet your use-case? I believe @weizhaowz was looking into this already, will defer to him if he has cycles to implement this
I was wondering about the future plans for this enhancement. Based on my testing with KubeRay 1.3.2, it appears that when DeletionPolicy is used, ttlSecondsAfterFinished does not seem to affect the associated RayCluster, causing it to persist. Please let me know if my understanding is incorrect.
@seanlaii this feature is an alpha status in v1.3.2 so you need to enable the RayJobDeletionPolicy feature gate
Hi @andrewsykim, thank you for the clarification and helpful context!
Apologies for any confusion earlier.
Our primary goal is to allow users to configure a TTL for debugging the Ray Head Node — once the TTL expires, the associated RayCluster should be cleaned up automatically. This behavior is similar to setting shutdownAfterJobFinishes: true, but with an additional grace period for debugging purposes.
Based on my understanding, when using a deletion policy like DeleteWorkers in combination with ttlSecondsAfterFinished, the worker pods should be terminated after job completion, and the head node (i.e., the RayCluster) should persist temporarily and then be deleted after the TTL. This would allow users to access the head node for a limited time post-job, which aligns well with our intended use case.
To validate this, I tested the following two configurations on both KubeRay Operator v1.4.0 and v1.3.2, with the RayJobDeletionPolicy feature gate enabled:
apiVersion: ray.io/v1
kind: RayJob
metadata:
generateName: distributed-mnist-rayjob-
namespace: default
spec:
entrypoint: python ./ray-examples/ray_train_mnist.py --epochs 3 --num-workers 2
shutdownAfterJobFinishes: true
ttlSecondsAfterFinished: 1
deletionStrategy:
onSuccess:
policy: DeleteWorkers
onFailure:
policy: DeleteWorkers
apiVersion: ray.io/v1
kind: RayJob
metadata:
generateName: distributed-mnist-rayjob-
namespace: default
spec:
entrypoint: python ./ray-examples/ray_train_mnist.py --epochs 3 --num-workers 2
shutdownAfterJobFinishes: true
ttlSecondsAfterFinished: 1
deletionPolicy: DeleteWorkers
In both cases, the worker pods were successfully removed after the job finished, but the RayCluster remained indefinitely. I'll dig a bit deeper to see if there's an issue with how the TTL is being applied in these scenarios.
As for the proposed per-strategy TTL syntax:
spec:
deletionPolicy:
onSuccess:
policy: DeleteCluster
ttl: 10m
onFailure:
policy: DeleteWorkers
ttl: 1h
This would definitely provide the flexibility we need — as long as the RayCluster is cleaned up after the specified TTL, it would fully address our use case. Appreciate your time and help on this!
In both cases, the worker pods were successfully removed after the job finished, but the RayCluster remained indefinitely. I'll dig a bit deeper to see if there's an issue with how the TTL is being applied in these scenarios.
I think this is the intended behavior. If you set deletion policy to DeleteWorkers, we only delete the worker pods and not the RayCluster after ttlSecondsAfterFinished. If you set policy to DeleteCluster, we delete the RayCluster. The ttlSecondsAfterFinished applies to all deletion policies. If you set it to DeleteWorkers, the workers are deleted after ttlSecondsAfterFinished
he worker pods should be terminated after job completion, and the head node (i.e., the RayCluster) should persist temporarily and then be deleted after the TTL.
^ this is the part worth clarifying. The TTL is for when the deletion policy is applied, not how long to persist the cluster before full deletion. Supporting this use-case may require a different API then what we currently have.
To support the use-case you mentioned, we may need to expand the API a bit. Something like this maybe?
deletionStrategy
onSuccess:
policy: DeleteCluster
onFailure:
policy: DeleteWorkers
ttl: 30m
onTTL:
policy: DeleteCluster
(just thinking out loud, open to hear other API suggestions)
Thank you for the clarification — that makes sense now! Our use case is closer to a two-stage cleanup.
To support the use-case you mentioned, we may need to expand the API a bit. Something like this maybe?
deletionStrategy onSuccess: policy: DeleteCluster onFailure: policy: DeleteWorkers ttl: 30m onTTL: policy: DeleteCluster(just thinking out loud, open to hear other API suggestions)
Yes, that would meet our use case! Thanks — I think your onTTL idea is a practical and extensible direction.
@andrewsykim is this closed by https://github.com/ray-project/kuberay/pull/3731?
@andrewsykim is this closed by #3731?
Andrew is on vacation so I hop on to answer, this feature requests "2-stage" deletion while 3731 implemented the 1st stage, so we need to expand the API to support the 2nd stage.
@andrewsykim is this closed by #3731?
Andrew is on vacation so I hop on to answer, this feature requests "2-stage" deletion while 3731 implemented the 1st stage, so we need to expand the API to support the 2nd stage.
oh, I think for this feature, it's closed by 3731, I answered something about the "2-stage" deletion request.
@weizhaowz can you help implement the 2-stage approach? I thinkt the API design needs more thought.
@weizhaowz can you help implement the 2-stage approach? I thinkt the API design needs more thought.
Yes, I can work on this feature.
@weizhaowz Is the new deletionStrategy supposed to work with clusterSelector? As of 3731 the operator deletes the cluster if the Rayjob using that cluster via clusterSelector terminates with non-zero exit code. Even with onFailure: DeleteNone
@weizhaowz Is the new deletionStrategy supposed to work with clusterSelector? As of 3731 the operator deletes the cluster if the Rayjob using that cluster via clusterSelector terminates with non-zero exit code. Even with onFailure: DeleteNone
Hi, I'm not sure the operator deletes cluster if RayJob enables deletion policy feature and cluster selector, and here is what I learn from the code.