kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature] Update RayJob DeletionPolicy API to differentiate success/failure scenarios

Open andrewsykim opened this issue 6 months ago • 14 comments

Search before asking

  • [x] I had searched in the issues and found no similar feature requirement.

Description

In KubeRay v1.3 we added a DeletionPolicy API for RayJob as an alpha feature: https://github.com/ray-project/kuberay/pull/2643

The current API looks like this:

apiVersion: ray.io/v1
kind: RayJob
metadata:
   name: ..
spec:
  deletionPolicy: DeleteCluster
   ...

However, we've received feedback that the deletion policy API would be more useful if it supported different deletion policies based on success or failure of the RayJob. For example, if a job succeeds the RayCluster should always be deleted, but if it fails, we may want to preserve only the Head pod for debugging. Supporting this would require a change to the current API, maybe something like the following:

apiVersion: ray.io/v1
kind: RayJob
metadata:
   name: ..
spec:
  deletionPolicy:
    onSuccess: DeleteCluster
    onFailure: DeleteWorkers

Given the feature is still in alpha status (not enabled by default), we could break this API before graduating the feature to beta.

Use case

Change the deletion policy for Ray clusters created by RayJob based on the status of the job

Related issues

No response

Are you willing to submit a PR?

  • [x] Yes I am willing to submit a PR!

andrewsykim avatar May 29 '25 17:05 andrewsykim

@weizhaowz are you available to work on this?

andrewsykim avatar May 29 '25 17:05 andrewsykim

@davidxia @kevin85421 thoughts on this change?

andrewsykim avatar May 29 '25 17:05 andrewsykim

@weizhaowz are you available to work on this?

Yes, I can work on it.

weizhaowz avatar May 29 '25 17:05 weizhaowz

and RayJob.spec.ttlSecondsAfterFinished will apply to both?

davidxia avatar May 29 '25 21:05 davidxia

@davidxia i think so, alternatively we can have TTLs per scenario like we discussed on slack:

apiVersion: ray.io/v1
kind: RayJob
metadata:
   name: ..
spec:
  deletionPolicy:
    onSuccess:
      policy: DeleteCluster
      ttlDuration: 1h
    onFailure:
      policy: DeleteWorkers
      ttlDuration: 7d

andrewsykim avatar May 29 '25 22:05 andrewsykim

@andrewsykim that works better for me. A middle ground can be just making onSuccess and onFailure objects with only policy fields to allow for the addition of more fields like ttlDuration in the future if more users find it useful.

apiVersion: ray.io/v1
kind: RayJob
metadata:
   name: ..
spec:
  deletionPolicy:
    onSuccess:
      policy: DeleteCluster
    onFailure:
      policy: DeleteWorkers

davidxia avatar May 30 '25 01:05 davidxia

Hi @kevin85421 @andrewsykim @davidxia @weizhaowz ,

We have a strong use case for this feature, as it would allow our users to configure a TTL for Ray Head Node used for debugging.

I was wondering about the future plans for this enhancement. Based on my testing with KubeRay 1.3.2, it appears that when DeletionPolicy is used, ttlSecondsAfterFinished does not seem to affect the associated RayCluster, causing it to persist. Please let me know if my understanding is incorrect.

If adding TTL support aligns with the project's roadmap, I would be happy to contribute a PR to help move this forward.

Thank you so much for your time!

seanlaii avatar Jun 25 '25 01:06 seanlaii

@seanlaii thanks for the feedback -- I had thought that ttlSecondsAfterFinished should still be respected when using DeletionPolicy https://github.com/ray-project/kuberay/blob/v1.3.2/ray-operator/controllers/ray/rayjob_controller.go#L363-L367. If this is not the case, we should investigate that bug separately.

We also previously discussed adding TTL field per strategy. Like this one:

spec:
  deletionPolicy:
    onSuccess:
      policy: DeleteCluster
      ttl: 10m
    onFailure:
      policy: DeleteWorkers
      ttl: 1h

Would that meet your use-case? I believe @weizhaowz was looking into this already, will defer to him if he has cycles to implement this

andrewsykim avatar Jun 25 '25 14:06 andrewsykim

I was wondering about the future plans for this enhancement. Based on my testing with KubeRay 1.3.2, it appears that when DeletionPolicy is used, ttlSecondsAfterFinished does not seem to affect the associated RayCluster, causing it to persist. Please let me know if my understanding is incorrect.

@seanlaii this feature is an alpha status in v1.3.2 so you need to enable the RayJobDeletionPolicy feature gate

andrewsykim avatar Jun 25 '25 15:06 andrewsykim

Hi @andrewsykim, thank you for the clarification and helpful context! Apologies for any confusion earlier. Our primary goal is to allow users to configure a TTL for debugging the Ray Head Node — once the TTL expires, the associated RayCluster should be cleaned up automatically. This behavior is similar to setting shutdownAfterJobFinishes: true, but with an additional grace period for debugging purposes.

Based on my understanding, when using a deletion policy like DeleteWorkers in combination with ttlSecondsAfterFinished, the worker pods should be terminated after job completion, and the head node (i.e., the RayCluster) should persist temporarily and then be deleted after the TTL. This would allow users to access the head node for a limited time post-job, which aligns well with our intended use case.

To validate this, I tested the following two configurations on both KubeRay Operator v1.4.0 and v1.3.2, with the RayJobDeletionPolicy feature gate enabled:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  generateName: distributed-mnist-rayjob-
  namespace: default
spec:
  entrypoint: python ./ray-examples/ray_train_mnist.py --epochs 3 --num-workers 2
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 1
  deletionStrategy:
    onSuccess:
      policy: DeleteWorkers
    onFailure:
      policy: DeleteWorkers
apiVersion: ray.io/v1
kind: RayJob
metadata:
  generateName: distributed-mnist-rayjob-
  namespace: default
spec:
  entrypoint: python ./ray-examples/ray_train_mnist.py --epochs 3 --num-workers 2
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 1
  deletionPolicy: DeleteWorkers

In both cases, the worker pods were successfully removed after the job finished, but the RayCluster remained indefinitely. I'll dig a bit deeper to see if there's an issue with how the TTL is being applied in these scenarios.

As for the proposed per-strategy TTL syntax:

spec:
  deletionPolicy:
    onSuccess:
      policy: DeleteCluster
      ttl: 10m
    onFailure:
      policy: DeleteWorkers
      ttl: 1h

This would definitely provide the flexibility we need — as long as the RayCluster is cleaned up after the specified TTL, it would fully address our use case. Appreciate your time and help on this!

seanlaii avatar Jun 25 '25 16:06 seanlaii

In both cases, the worker pods were successfully removed after the job finished, but the RayCluster remained indefinitely. I'll dig a bit deeper to see if there's an issue with how the TTL is being applied in these scenarios.

I think this is the intended behavior. If you set deletion policy to DeleteWorkers, we only delete the worker pods and not the RayCluster after ttlSecondsAfterFinished. If you set policy to DeleteCluster, we delete the RayCluster. The ttlSecondsAfterFinished applies to all deletion policies. If you set it to DeleteWorkers, the workers are deleted after ttlSecondsAfterFinished

andrewsykim avatar Jun 25 '25 16:06 andrewsykim

he worker pods should be terminated after job completion, and the head node (i.e., the RayCluster) should persist temporarily and then be deleted after the TTL.

^ this is the part worth clarifying. The TTL is for when the deletion policy is applied, not how long to persist the cluster before full deletion. Supporting this use-case may require a different API then what we currently have.

andrewsykim avatar Jun 25 '25 16:06 andrewsykim

To support the use-case you mentioned, we may need to expand the API a bit. Something like this maybe?

deletionStrategy
  onSuccess:
    policy: DeleteCluster
  onFailure:
    policy: DeleteWorkers
    ttl: 30m
  onTTL:
    policy: DeleteCluster 

(just thinking out loud, open to hear other API suggestions)

andrewsykim avatar Jun 25 '25 17:06 andrewsykim

Thank you for the clarification — that makes sense now! Our use case is closer to a two-stage cleanup.

To support the use-case you mentioned, we may need to expand the API a bit. Something like this maybe?

deletionStrategy
  onSuccess:
    policy: DeleteCluster
  onFailure:
    policy: DeleteWorkers
    ttl: 30m
  onTTL:
    policy: DeleteCluster 

(just thinking out loud, open to hear other API suggestions)

Yes, that would meet our use case! Thanks — I think your onTTL idea is a practical and extensible direction.

seanlaii avatar Jun 25 '25 17:06 seanlaii

@andrewsykim is this closed by https://github.com/ray-project/kuberay/pull/3731?

davidxia avatar Jul 02 '25 15:07 davidxia

@andrewsykim is this closed by #3731?

Andrew is on vacation so I hop on to answer, this feature requests "2-stage" deletion while 3731 implemented the 1st stage, so we need to expand the API to support the 2nd stage.

weizhaowz avatar Jul 02 '25 16:07 weizhaowz

@andrewsykim is this closed by #3731?

Andrew is on vacation so I hop on to answer, this feature requests "2-stage" deletion while 3731 implemented the 1st stage, so we need to expand the API to support the 2nd stage.

oh, I think for this feature, it's closed by 3731, I answered something about the "2-stage" deletion request.

weizhaowz avatar Jul 02 '25 17:07 weizhaowz

@weizhaowz can you help implement the 2-stage approach? I thinkt the API design needs more thought.

andrewsykim avatar Jul 07 '25 17:07 andrewsykim

@weizhaowz can you help implement the 2-stage approach? I thinkt the API design needs more thought.

Yes, I can work on this feature.

weizhaowz avatar Jul 07 '25 17:07 weizhaowz

@weizhaowz Is the new deletionStrategy supposed to work with clusterSelector? As of 3731 the operator deletes the cluster if the Rayjob using that cluster via clusterSelector terminates with non-zero exit code. Even with onFailure: DeleteNone

ldk300 avatar Jul 24 '25 09:07 ldk300

@weizhaowz Is the new deletionStrategy supposed to work with clusterSelector? As of 3731 the operator deletes the cluster if the Rayjob using that cluster via clusterSelector terminates with non-zero exit code. Even with onFailure: DeleteNone

Hi, I'm not sure the operator deletes cluster if RayJob enables deletion policy feature and cluster selector, and here is what I learn from the code.

weizhaowz avatar Jul 30 '25 03:07 weizhaowz