kuberay
kuberay copied to clipboard
[Feature] Make Replicas field in WorkerGroupSpec optional
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
The Replicas field in WorkerGroupSpec
is currently required. This field should be optional since it's a dynamic property and can be updated by the autoscaler.
When we update other static properties (i.e. MaxReplicas) of a Ray cluster, we have to make sure the Replicas field is in sync with the value currently in use, so our deployment process wouldn't accidentally change the state of the Ray cluster running with a different replica value.
Use case
If Replicas
is optional, when I update the Ray cluster with the autoscaler enabled, I don't need to make sure Replicas
field has the same value as the one that is currently been using. I can simply ignore this field and update other static fields without affecting the current status of the cluster.
Related issues
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
This is a little subtle. The Replicas
field is a goal state that the operator tries to maintain. In the current architecture it has to be present in the CR so that the operator knows what to do.
The autoscaler works by editing the Replicas field.
On the other hand, it is awkward to have to specify a Replicas
field on a cluster that is currently autoscaling.
One current workaround is to apply changes to the CR only by PATCHing it. Could you say more about your workflow for updating a RayCluster?
I think it'd be possible to work out appropriate defaulting behavior for replicas, but it might be a little tricky.
Hmm, I guess if the Replicas
field were empty, the operator's reconcile logic could determine the correct Replicas
to fill in based on
- The current pod count
- MinReplicas
- MaxReplicas
If absent, set Replicas to the current pod count, unless current pod count is out of range of [MinReplicas, MaxReplicas], in which case set Replicas to MinReplicas or MaxReplicas as appropriate.
This strategy would be strange in that it basically sets spec according to status, but it could work.
One current workaround is to apply changes to the CR only by PATCHing it. Could you say more about your workflow for updating a RayCluster?
We currently manage several Ray clusters for different teams in a declarative way. We do give users the permission to scale up/down the cluster in their own namespaces, so the clusters may run with a different number of replicas due to the manual overriding or autoscaling. When we perform the maintenance work on managed Ray clusters and change their static properties (i.e. updating MaxReplicas, adding new worker groups), we don't want to override the existing Replicas
value that is currently used. It would be nice if the proper replica number can be determined by the operator when it's absent.
For now, the suggestion is to patch just the replica count.
@daikeshi can correct me if I'm wrong, but we've switched our user workflow to creating RayClusters on an as-needed basis via a CLI. So this is less of a concern for us now.