kuberay
kuberay copied to clipboard
[Feature] Finalizer to block deletion of RayCluster with running jobs
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
I would like to introduce a finalizer that can be used with RayCluster to block deletion until all jobs in the Ray cluster are completed.
Use case
This feature would allow you to delete a Ray cluster while jobs are still running. The finalizer will ensure that all jobs are completed before cleaning up resources by querying the Ray head service. This is handy for when you want to automatically clean up resource immediately after a long-running training job. Even more important for larger jobs where resources need to be cleaned up as soon as possible to save costs.
This can also be used as a safety measure to ensure RayClusters with running jobs can't be accidentally deleted.
While RayJob can be used for similar use-cases, it is not a viable option for longer-lived RayClusters that can accept multiple jobs before being deleted.
Related issues
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
Thoughts @kevin85421 ?
The suspend feature in RayJob will issue a request to the Ray head Pod to halt the job before the RayCluster is deleted. For RayCluster, I prefer to avoid doing too many things on the data plane (i.e. Ray). If users want to suspend a RayCluster, they should make sure all jobs are stopped by themselves.
Btw, this is pertaining to deletion, not suspension.
@kevin85421 here's the use-case I am thinking about it:
- Data Scientist starts their day by creating a RayCluster. The cluster is large and consumes expensive hardware accelerators.
- Throughout the day they run several jobs on their cluster. They submit multiple jobs interactively, which is why RayJob is not a viable option.
- Near the end of the day, they want to run one more job that will take several hours to complete and they want to check the results the next day.
- Because their cluster is very expensive, they want the cluster to be automatically deleted after the final job completes but they don't want to babysit the job until completion.
- They add a finalizer to the RayCluster
ray.io/wait-for-job-completionand then runs the delete commandkubectl delete raycluster my-cluster. - kuberay-operator sees the finalizer and waits for all jobs to complete before cleaning up all resources/
- Data Scientists checks the results the next day and spins up new RayCluster again to start development.
Note that the finalizer would be optional and blocking deletion on job completion is not default behavior. I agree with your previous comment that we don't need to cover this for suspension
We has the similar use-case:
- The RayCluster is shared by some tenants, they will submit job to it.
- When the RayCluster is deleting, pods should not be deleted before all jobs is completed
- New job will be forbidden