[Feature] Add `DeleteWorkersOnFailure` deletion policy for RayJob
Search before asking
- [x] I had searched in the issues and found no similar feature requirement.
Description
DeleteWorkersOnFailure: Deletes workers only when the Ray job fails and deletes the entire RayCluster when the Ray job succeeds. This seems to be a more common pattern for users.
Should we add this policy or rename DeleteWorkers to DeleteWorkersOnFailure? Does it need to be in v1.3.0?
Use case
No response
Related issues
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
cc @andrewsykim any thoughts?
Having both policies probably makes sense. I'm in favor of a new policy like DeleteWorkersOnFailure. Would it be too verbose to name it something like DeleteClusterOnSuccessOrWorkersOnFailure?
DeleteWorkersOnFailure is probably fine as long as the deletion policy on success is well documented in the API or documentation
On second thought, I realized that users may have more combinations. For example,
- DeleteCluster on success, DeleteNone on failure
- DeleteSelf on success, DeleteNone on failure
- DeleteCluster on success, DeleteWorkers on failure
- DeleteSelf on success, DeleteWorkers on failure
There are two solutions:
- Keep the current API, but adds new API like
DeleteClusterOnSuccessOrWorkersOnFailureif needed - Separate deletion API into
OnFailureDeletionPolicyandOnSuccessDeletionPolicy.
mark this issue as v1.3.0 because we need to make a decision about the API before the release.
On second thought, I realized that users may have more combinations. For example,
DeleteCluster on success, DeleteNone on failure DeleteSelf on success, DeleteNone on failure DeleteCluster on success, DeleteWorkers on failure DeleteSelf on success, DeleteWorkers on failure There are two solutions:
Keep the current API, but adds new API like DeleteClusterOnSuccessOrWorkersOnFailure if needed Separate deletion API into OnFailureDeletionPolicy and OnSuccessDeletionPolicy.
These are really good considerations, since we put the feature behind an alpha feature gate I feel fine about breaking the API in v1.4 if needed.
We can consider an API like this as well:
spec:
deletionPolicy:
onSuccess: DeleteCluster
onFailure: DeleteWorkers
OK, let's update the API in v1.4.0.