[Feature] Add ability to modify the image of a worker group in RayCluster with rolling upgrade or restart
Search before asking
- [x] I had searched in the issues and found no similar feature requirement.
Description
When updating image in RayCluster definition of a worker group: perform worker group restart or rolling update.
Use case
When updating docker image of a worker group currently the group will continue to run with the old image until cluster is restarted or re-created. Right now I just re-create the cluster after each image version update.
Related issues
#2534
Are you willing to submit a PR?
- [x] Yes I am willing to submit a PR!
We also experimented with scale to zero and back. But it may lead to version discrepancies between various worker groups. So it's safer to just kill everything in our case.
@pkit what's your expectation for the head pod? Do you ever expect KubeRay to recreate it if you update the image even if it means losing the GCS state?
I don't care much about head pod. I manage all actors from the application anyway. So if it restarts it's ok. Obviously it would be nice if it would re-create the actors and be fault tolerant, but it's just "nice to have".
can you try:
- raycluster CR's worker suspend -> true
- update image
- raycluster CR's worker suspend -> false
add a doc
@pkit: @win5923 is working on this #4185.
@kevin85421 looks good!
@pkit I want to make sure you understand that the limitation of #4185. To upgrade the Ray version, the Ray head Pod must be recreated, and all running jobs in the cluster will fail. Users need to drain the cluster before upgrading. Is this the behavior you are looking for?
@pkit yes, I'm aware of that limitation. We can re-create cluster on ray upgrade. But it happens much less frequently than image updates.