kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature] Support JobDeploymentStatus as the deletion condition

Open JiangJiaWei1103 opened this issue 1 month ago • 6 comments

Why are these changes needed?

The current deletionStrategy relies exclusively on the terminal states of JobStatus (SUCCEEDED or FAILED). However, there are several scenarios in which a user-deployed RayJob ends up with JobStatus == "" (JobStatusNew) while JobDeploymentStatus == "Failed". In these cases, the associated resources (e.g., RayJob, RayCluster, etc.) remain stuck and are never cleaned up, resulting in indefinite resource consumption.

Changes

  • Add the JobDeploymentStatus field to DeletionCondition
    • Currently supports Failed only
  • Enforce mutual exclusivity between JobStatus and JobDeploymentStatus within DeletionCondition

Implementation Details

To determine which field the user specifies, we use pointers instead of raw values. Both JobStatus and JobDeploymentStatus have empty strings as their zero values, which correspond to a "new" state. Using nil allows us to reliably distinguish between "unspecified" and "explicitly set," avoiding unintended ambiguity.

Related issue number

Closes https://github.com/ray-project/kuberay/issues/4233.

Checks

  • [x] I've made sure the tests are passing.
  • Testing Strategy
    • [x] Unit tests
    • [x] Manual tests
    • [ ] This PR is not tested :(

JiangJiaWei1103 avatar Dec 07 '25 23:12 JiangJiaWei1103

The helm lint is failing.

rueian avatar Dec 08 '25 00:12 rueian

The helm lint is failing.

Will fix after getting of work, thanks for reviewing!

pickymodel avatar Dec 08 '25 00:12 pickymodel

cc @seanlaii and @win5923 for help. Note that we need to wait until @andrewsykim is back to discuss the API change.

Future-Outlier avatar Dec 08 '25 10:12 Future-Outlier

Hi @JiangJiaWei1103, Can you also update the comment to mention that JobDeploymentStatus is also support?

https://github.com/ray-project/kuberay/blob/e32405e7a852c20bfdaa4ebe65897b544be8d9e5/ray-operator/config/samples/ray-job.deletion-rules.yaml#L12-L22

Hi @win5923, nice suggestion. I'm considering adding one more sample demonstrating JobDeploymentStatus-based deletion rules, wdyt?

JiangJiaWei1103 avatar Dec 09 '25 11:12 JiangJiaWei1103

Thanks! Overall LGTM. Only some reminders:

  1. We might need to either move the e2e tests for DeletionStrategy to a separate action in the CI pipeline or increase the timeout for the e2e tests as it exceeds the current timeout: 40mins. cc @rueian
  2. It might be good to clarify that we evaluate the rules in order, so if a user specifies a different deletionPolicy for a similar status, the first deletionRule will be used, for example:
deletionRules:
    - condition:
        jobStatus: FAILED
        ttlSeconds: 30
      policy: DeleteWorkers
    - condition:
        jobDeploymentStatus: FAILED
        ttlSeconds: 30
      policy: DeleteCluster

Thanks!

Thanks for reviewing! For the first one, let's wait for rueian's reply.

As for the second, since both rules match the the corresponding status, they will be added to overdueRules. selectMostImpactfulRule then prioritize the most important rule (DeleteCluster 3 > DeleteWorkers 2 in this case), so I think DeleteCluster will be executed first. Following demonstrates an example:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  namespace: default
  name: demo-del-rules
spec:
  submissionMode: "K8sJobMode"
  entrypoint: "python -c 'import sys, time; time.sleep(45); sys.exit(1)'"

  deletionStrategy:
    deletionRules:
    - condition:
        jobStatus: FAILED
        ttlSeconds: 10
      policy: DeleteWorkers
    - condition:
        jobDeploymentStatus: Failed
        ttlSeconds: 10
      policy: DeleteCluster

    # ...

The following shows the most important policy DeleteCluster is executed:

{"level":"info","ts":"2025-12-14T09:42:29.001+0800","logger":"controllers.RayJob","msg":"Executing the most impactful overdue deletion rule","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","deletionMechanism":"DeletionRules","rule":{"policy":"DeleteCluster","condition":{"jobDeploymentStatus":"Failed","ttlSeconds":10}},"overdueRulesCount":2}
{"level":"info","ts":"2025-12-14T09:42:29.001+0800","logger":"controllers.RayJob","msg":"Executing deletion policy: DeleteCluster","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","RayCluster":"del-seq-gcz64"}
{"level":"info","ts":"2025-12-14T09:42:29.013+0800","logger":"controllers.RayJob","msg":"The associated RayCluster for RayJob is deleted","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","RayCluster":{"name":"del-seq-gcz64","namespace":"default"}}
{"level":"info","ts":"2025-12-14T09:42:29.013+0800","logger":"controllers.RayJob","msg":"deleteClusterResources","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","isClusterDeleted":false}
{"level":"info","ts":"2025-12-14T09:42:29.013+0800","logger":"controllers.RayJob","msg":"All applicable deletion rules have been processed.","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","deletionMechanism":"DeletionRules"}
{"level":"info","ts":"2025-12-14T09:42:29.015+0800","logger":"controllers.RayJob","msg":"RayJob","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"7ea2a0cf-1f04-4978-b819-ff7f4784b50e","JobStatus":"FAILED","JobDeploymentStatus":"Failed","SubmissionMode":"K8sJobMode"}
{"level":"info","ts":"2025-12-14T09:42:29.015+0800","logger":"controllers.RayJob","msg":"Skipping completed deletion rule","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"7ea2a0cf-1f04-4978-b819-ff7f4784b50e","deletionMechanism":"DeletionRules","rule":{"policy":"DeleteWorkers","condition":{"jobStatus":"FAILED","ttlSeconds":10}}}
{"level":"info","ts":"2025-12-14T09:42:29.015+0800","logger":"controllers.RayJob","msg":"Skipping completed deletion rule","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"7ea2a0cf-1f04-4978-b819-ff7f4784b50e","deletionMechanism":"DeletionRules","rule":{"policy":"DeleteCluster","condition":{"jobDeploymentStatus":"Failed","ttlSeconds":10}}}

If I'm mistaken, please let me know. Thanks a lot!

JiangJiaWei1103 avatar Dec 14 '25 01:12 JiangJiaWei1103

As for the second, since both rules match the the corresponding status, they will be added to overdueRules. selectMostImpactfulRule then prioritize the most important rule (DeleteCluster 3 > DeleteWorkers 2 in this case) to execute, so I think DeleteCluster will be executed first.

You are right. My mistake. Forgot that it will first fetch all the rules that are matched. Thanks for the explanation!

seanlaii avatar Dec 14 '25 01:12 seanlaii