kuberay
kuberay copied to clipboard
[Feature] Enable replicated KubeRay operator deployments
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
To improve operator availability, it should be possible to enable operator deployments with >1 replicas, using leader election. This pattern is not currently supported, see the discussion in https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1660585607010159
Use case
Improved availability for the KubeRay operator.
Related issues
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
one thing to note here, is that multiple KubeRay instances do not improve performance, on the contrary, it might slow it down (due to leader election failure).
Since at any point in time we should have only 1 ACTIVE instance (for consistency), the value of adding more replicas is not that evident.
A good example is the K8s scheduler, there is only 1 instance of it in the cluster, and can handle a lot of load.
The leader election pattern is particularly useful if the KubeRay startup time is really long, e.g. 1 or 2 minutes. In that case, a potential advantage to leader election over multiple instances in faster recovery time, but since the KubeRay startup time after a failure is really small now (a few seconds), then a standby replica is not really needed, I think the leader election might do more harm than good at this point
The KubeRay operator image is very small so it can quickly be pulled on another k8s node in case of machine failure.
I suppose you could have a real issue if the operator's host K8s node goes down, there is no additional capacity in your K8s cluster, and your cluster autoscaler has to provision a new cloud instance. Probably heading into the realm of edge-cases here.
@DmitriGekhtman can we close this issue? Thanks!
I've created a P3 label just for this issue :) Anyscale has learned that there are some natural reasons for wanting leader election -- in particular, it can protect against degraded hardware.
I suppose you could have a real issue if the operator's host K8s node goes down, there is no additional capacity in your K8s cluster, and your cluster autoscaler has to provision a new cloud instance. Probably heading into the realm of edge-cases here.
This can be handled by setting priorityClassName to for example 'system-cluster-critical' which is set up by default in k8s clusters and should be used only for operators. This way it can evict other pod to ensure sceduling kuberay operator as fast as possible, while evicted pods can wait for new node.