kuberay
kuberay copied to clipboard
[Feature] rolling upgrade design and implementation for Kuberay
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
Right now we don't support Ray cluster rolling upgrade. This is a valid requirement for customers that has a large number of nodes in their Ray cluster deployment.
Use case
support rolling upgrade of Ray clusters which can be beneficial to users with large Ray cluster.
Related issues
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
This would be great to have. Let's figure out a design... Would we be aiming for something similar to rollouts for Deployments?
Note: we need to review #231 and make a new design.
Would we be aiming for something similar to rollouts for Deployments?
@DmitriGekhtman I think we can separate the update for two role:
- for head node, we can just delete old one and pull up a new one, here we need to consider how it interacts with the HA mechanism.
- for worker node, yes we can use similar rolling update logic in the deployments, however there may be some difference like deployment do not support scale old version replicas to 0 so it keep a
MaxUnavailable
orMaxSurge
but in ray we can have only head node with no worker, so we need to define some new behaviors
here me and Wilson will come up with detail design and we may have several round discussion.
I like the strategy of splitting the discussion (and potentially even implementation) into updates for head and updates for worker.
cc @brucez-anyscale for the head node HA aspect. Stating the question again: What should happen when you change the configuration for a RayCluster's head pod?
here me and Wilson will come up with detail design and we may have several round discussion.
That's great! I'm looking forward to discussing the design of this functionality -- I think it's very important.
Right now. RayService does the whole cluster level upgrading, so RayService works itself for now. RayCluster rolling upgrade: I think the head node or worker node should be backward compatible, so they can join back the Ray cluster.
@wilsonwang371 Here I think we need to find the exact use case that user can benefit from the feature.
First is the user behavior, following the previous discussion, we can make the assumption that in this story:
- the users would like to upgrade the image of worker.
- the users would like to upgrade the image of head.
- the user would like to upgrade both head and workers.
In all of those cases, we need to enable the mechanism that the ray package in images is compatible.
Here are some scenarios that I can think about:
- there is no actor or task running on the
raycluster
In this case, we would not need the feature since the recreate strategy would be enough, the only modification is to enable the worker upgrade in the reconcile.
- there is some jobs running on the
raycluster
, some remaining actors inside the old one.
here the situation is a little bit tricky since we need to support mechanisms in ray that migrate actors from old raycluster
to the new one.
- there is a ray service running on the
raycluster
, just as @brucez-anyscale said the whole cluster would upgrade
This case is the most possible to have the rolling upgrade feature. Since for now we may recreate a brand new raycluster
by rayservice controller
. we may support the rolling upgrade in raycluster controller
to ease the ray service upgrade.
Indeed we need support standard update semantic for raycluster
, at least in recreate strategy. However, for now, consider those cases, would the feature raycluster rolling upgrade
bring any further significant benefit to the user? WDYT @DmitriGekhtman
Let's first consider the most basic use-case that we were going for with the --forced-cluster-upgrade flag.
When a user updates a RayCluster CR and applies it, they expect changes to pod configs to be reflected in the actual pod configuration, even if the change is potentially disruptive to the Ray workload. If you update a workerGroupSpec, workers with outdated configuration should be eliminated and workers with update configuration should be created. Same thing for the HeadGroupSpec.
The ability to do (destructive) updates is available with the Ray Autoscaler's VM node providers and with the legacy python-based Ray operator. The implementation for this uses hashes of last-applied node configuration. We could potentially do the same thing here.
If Ray versions mismatch, things won't work out, no matter what, because Ray does not have cross-version compatibility. If workloads are running, they may be interrupted. These are complex, higher-order concerns, but we can start by just registering pod config updates.
Question - why to create pod's directly and not Deployment
- which would handle this ? (side note - not familiar with ray in particular, just an operator of a kubernetes cluster where ray is deployed)
I am curious if there has been any update on this feature or do we have any plans?
If we are worried that we do not have a strong use case to focus on this, I can help. Not have rolling upgrade is a real pain for us. I am talking from the perspective of a ML platform that supports all ML teams within a company.
- We plan to have several ray clusters, standing and ephemeral. Think one ray cluster for model-dev (interactive), automated training, batch serving and real-time serving per project a group or project in one ML team.
- For standing clusters, not having rolling upgrade is like going back by a few years in infrastructure, for us. Every service we have has rolling upgrade, and we do not allow downtimes in the production services.
- For real-time serving (ray-serve) this is a blocker. The serving needs to be available 24x7, there is no acceptable downtime outside of SLA.
- Since the project specific python dependencies are going to be baked into the image running on the workers, we will need to update this for every change. This does happen frequently for us, and having a scheduled downtime to do this out of the norm for our infrastructure.
- Since Kuberay is at v0.5.0 we expect to keep up with the rapid version upgrades. And this will require us to delete and recreate all our ray clusters.
- Deleting a resource and recreating is not a standard CI/CD operation for us. This requires custom steps or manual support in our case. Deleting a resource manually is reserved for emergency situations, but the lack of rolling upgrade requires us to use it frequently.
I am happy to discuss more on this, or help any way I can.
@jhasm I don't want to speak for others, but believe serve will be critical to ensure 100% uptime during upgrades of Ray Cluster versions. The way a model is served shouldn't hinder the upgrade i.e serve cli, sdk, etc. I had some thoughts I wanted to share.
There may be opportunities to enable cluster version rolling upgrades using Ray's GCS external Redis.
A potential starting point may be to detect when the Ray Cluster version changes. If the version changes and the cluster name is currently deployed, then launch a new Ray cluster. Once jobs are transferred, have kuberay rewrite the service to point to the new cluster. I believe the more complex portion is transferring the jobs and actors to the new cluster.
Good point: https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1682018043124949?thread_ts=1681846159.725999&cid=C02GFQ82JPM
Keep the head service and serve service with the same name.
Any update on this? Lack of rolling-update is like a no-go for many production serving workloads.
The RayService custom resource is intended to support the upgrade semantics of the sort people in this thread are looking for.
An individual Ray Cluster should be thought of as a massive pod -- there is not a coherent way to conduct a rolling upgrade of a single Ray cluster (though actually some large enterprises have actually managed to achieve this)
tl;dr solutions for upgrades require multiple Ray clusters
In my experience, doing anything "production-grade" with Ray requires multiple Ray clusters and external orchestration.
@qizzzh, I just saw your message. As @DmitriGekhtman mentioned, upgrading Ray involves more than one RayCluster. For RayService, we plan to support incremental upgrades, meaning that we won't need a new, large RayCluster for a zero-downtime upgrade. Instead, we will gradually increase the size of the new RayCluster and decrease the size of the old one. If you want to chat more, feel free to reach out to me on Slack.
Ray doesn't natively support rolling upgrade. It is impossible for KubeRay to achieve that (in the single RayCluster). This issue should move to Ray instead of KubeRay. Close this issue. I will open new issues to track incremental upgrade when I start to work on it.