[Feature] Support JobDeploymentStatus as the deletion condition
Why are these changes needed?
The current deletionStrategy relies exclusively on the terminal states of JobStatus (SUCCEEDED or FAILED). However, there are several scenarios in which a user-deployed RayJob ends up with JobStatus == "" (JobStatusNew) while JobDeploymentStatus == "Failed". In these cases, the associated resources (e.g., RayJob, RayCluster, etc.) remain stuck and are never cleaned up, resulting in indefinite resource consumption.
Changes
- Add the
JobDeploymentStatusfield toDeletionCondition- Currently supports
Failedonly
- Currently supports
- Enforce mutual exclusivity between
JobStatusandJobDeploymentStatuswithinDeletionCondition
Implementation Details
To determine which field the user specifies, we use pointers instead of raw values. Both JobStatus and JobDeploymentStatus have empty strings as their zero values, which correspond to a "new" state. Using nil allows us to reliably distinguish between "unspecified" and "explicitly set," avoiding unintended ambiguity.
Related issue number
Closes https://github.com/ray-project/kuberay/issues/4233.
Checks
- [x] I've made sure the tests are passing.
- Testing Strategy
- [x] Unit tests
- [x] Manual tests
- [ ] This PR is not tested :(
The helm lint is failing.
The helm lint is failing.
Will fix after getting of work, thanks for reviewing!
cc @seanlaii and @win5923 for help. Note that we need to wait until @andrewsykim is back to discuss the API change.
Hi @JiangJiaWei1103, Can you also update the comment to mention that JobDeploymentStatus is also support?
https://github.com/ray-project/kuberay/blob/e32405e7a852c20bfdaa4ebe65897b544be8d9e5/ray-operator/config/samples/ray-job.deletion-rules.yaml#L12-L22
Hi @win5923, nice suggestion. I'm considering adding one more sample demonstrating JobDeploymentStatus-based deletion rules, wdyt?
Thanks! Overall LGTM. Only some reminders:
- We might need to either move the e2e tests for
DeletionStrategyto a separate action in the CI pipeline or increase the timeout for the e2e tests as it exceeds the current timeout:40mins. cc @rueian- It might be good to clarify that we evaluate the rules in order, so if a user specifies a different
deletionPolicyfor a similar status, the firstdeletionRulewill be used, for example:deletionRules: - condition: jobStatus: FAILED ttlSeconds: 30 policy: DeleteWorkers - condition: jobDeploymentStatus: FAILED ttlSeconds: 30 policy: DeleteClusterThanks!
Thanks for reviewing! For the first one, let's wait for rueian's reply.
As for the second, since both rules match the the corresponding status, they will be added to overdueRules. selectMostImpactfulRule then prioritize the most important rule (DeleteCluster 3 > DeleteWorkers 2 in this case), so I think DeleteCluster will be executed first. Following demonstrates an example:
apiVersion: ray.io/v1
kind: RayJob
metadata:
namespace: default
name: demo-del-rules
spec:
submissionMode: "K8sJobMode"
entrypoint: "python -c 'import sys, time; time.sleep(45); sys.exit(1)'"
deletionStrategy:
deletionRules:
- condition:
jobStatus: FAILED
ttlSeconds: 10
policy: DeleteWorkers
- condition:
jobDeploymentStatus: Failed
ttlSeconds: 10
policy: DeleteCluster
# ...
The following shows the most important policy DeleteCluster is executed:
{"level":"info","ts":"2025-12-14T09:42:29.001+0800","logger":"controllers.RayJob","msg":"Executing the most impactful overdue deletion rule","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","deletionMechanism":"DeletionRules","rule":{"policy":"DeleteCluster","condition":{"jobDeploymentStatus":"Failed","ttlSeconds":10}},"overdueRulesCount":2}
{"level":"info","ts":"2025-12-14T09:42:29.001+0800","logger":"controllers.RayJob","msg":"Executing deletion policy: DeleteCluster","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","RayCluster":"del-seq-gcz64"}
{"level":"info","ts":"2025-12-14T09:42:29.013+0800","logger":"controllers.RayJob","msg":"The associated RayCluster for RayJob is deleted","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","RayCluster":{"name":"del-seq-gcz64","namespace":"default"}}
{"level":"info","ts":"2025-12-14T09:42:29.013+0800","logger":"controllers.RayJob","msg":"deleteClusterResources","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","isClusterDeleted":false}
{"level":"info","ts":"2025-12-14T09:42:29.013+0800","logger":"controllers.RayJob","msg":"All applicable deletion rules have been processed.","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","deletionMechanism":"DeletionRules"}
{"level":"info","ts":"2025-12-14T09:42:29.015+0800","logger":"controllers.RayJob","msg":"RayJob","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"7ea2a0cf-1f04-4978-b819-ff7f4784b50e","JobStatus":"FAILED","JobDeploymentStatus":"Failed","SubmissionMode":"K8sJobMode"}
{"level":"info","ts":"2025-12-14T09:42:29.015+0800","logger":"controllers.RayJob","msg":"Skipping completed deletion rule","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"7ea2a0cf-1f04-4978-b819-ff7f4784b50e","deletionMechanism":"DeletionRules","rule":{"policy":"DeleteWorkers","condition":{"jobStatus":"FAILED","ttlSeconds":10}}}
{"level":"info","ts":"2025-12-14T09:42:29.015+0800","logger":"controllers.RayJob","msg":"Skipping completed deletion rule","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"7ea2a0cf-1f04-4978-b819-ff7f4784b50e","deletionMechanism":"DeletionRules","rule":{"policy":"DeleteCluster","condition":{"jobDeploymentStatus":"Failed","ttlSeconds":10}}}
If I'm mistaken, please let me know. Thanks a lot!
As for the second, since both rules match the the corresponding status, they will be added to overdueRules. selectMostImpactfulRule then prioritize the most important rule (DeleteCluster 3 > DeleteWorkers 2 in this case) to execute, so I think DeleteCluster will be executed first.
You are right. My mistake. Forgot that it will first fetch all the rules that are matched. Thanks for the explanation!