kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature] Add ability to modify the image of a worker group in RayCluster with rolling upgrade or restart

Open pkit opened this issue 4 months ago • 9 comments

Search before asking

  • [x] I had searched in the issues and found no similar feature requirement.

Description

When updating image in RayCluster definition of a worker group: perform worker group restart or rolling update.

Use case

When updating docker image of a worker group currently the group will continue to run with the old image until cluster is restarted or re-created. Right now I just re-create the cluster after each image version update.

Related issues

#2534

Are you willing to submit a PR?

  • [x] Yes I am willing to submit a PR!

pkit avatar Jul 30 '25 00:07 pkit

We also experimented with scale to zero and back. But it may lead to version discrepancies between various worker groups. So it's safer to just kill everything in our case.

pkit avatar Jul 30 '25 00:07 pkit

@pkit what's your expectation for the head pod? Do you ever expect KubeRay to recreate it if you update the image even if it means losing the GCS state?

andrewsykim avatar Jul 30 '25 01:07 andrewsykim

I don't care much about head pod. I manage all actors from the application anyway. So if it restarts it's ok. Obviously it would be nice if it would re-create the actors and be fault tolerant, but it's just "nice to have".

pkit avatar Jul 30 '25 01:07 pkit

can you try:

  1. raycluster CR's worker suspend -> true
  2. update image
  3. raycluster CR's worker suspend -> false

Future-Outlier avatar Oct 11 '25 01:10 Future-Outlier

add a doc

Future-Outlier avatar Oct 11 '25 01:10 Future-Outlier

@pkit: @win5923 is working on this #4185.

kevin85421 avatar Dec 03 '25 21:12 kevin85421

@kevin85421 looks good!

pkit avatar Dec 03 '25 21:12 pkit

@pkit I want to make sure you understand that the limitation of #4185. To upgrade the Ray version, the Ray head Pod must be recreated, and all running jobs in the cluster will fail. Users need to drain the cluster before upgrading. Is this the behavior you are looking for?

kevin85421 avatar Dec 03 '25 22:12 kevin85421

@pkit yes, I'm aware of that limitation. We can re-create cluster on ray upgrade. But it happens much less frequently than image updates.

pkit avatar Dec 03 '25 22:12 pkit