[BUG] rancher-webhook does not run with high-availability
Rancher Server Setup
- Rancher version: 2.7.5
- Installation option (Docker install/Helm Chart):
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
- Proxy/Cert Details:
Information about the Cluster
- Kubernetes version: v1.27.6+rke2r1
- Cluster Type (Local/Downstream): RKE2
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):
User Information
- What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
- If custom, define the set of permissions:
Describe the bug
The rancher-webhook deployment only runs a single replica. To ensure high-availability of the component it should run with multiple replicas spread with pod-anti affinity & topology spread constraints.
In cases where the pod cannot be scheduled, or the availability zone (aws) where the pod is running is having issues it will impact the overall availability of rancher.
To Reproduce
Result
rancher-webhook currently runs as a single replica.
Expected Result
rancher-webhook deployment should be running with multiple replicas, pod anti-affinity, and topology spread constraints. Preferably all configurable via the primary rancher helm chart.
Screenshots
Additional context
+1 I'm really surprised there hasn't been more of an uproar about this. Any time the (single) rancher-webhook Pod gets evicted due to a Spot instance being reclaimed or something, the whole cluster goes haywire because the rancher-webhook admission webhook is configured (by Rancher) with failurePolicy: Fail, so any time rancher-webhook is unavailable, the whole API server breaks down.
Related: https://github.com/rancher/webhook/issues/365
Solution here is not as trivial as just increasing the replica count
the webhook is currently responsible for managing its validatingwebhookconfigurations and mutatingwebhookconfigurations, their behavior needs to be verified when webhook is HA
Similarly, webhook uses dynamiclistener for its TLS certs, it generates them, create Secret resources in the cluster and adds the TLS certs in the caBundle field of the *webhookconfigurations to tell the kube-apiserver what to trust. Again, if we simply increase the number of replicas, then its unclear what the behavior will be. Will they compete over which CA cert and TLS certs is the correct one? etc
I think the answer is pretty clear to me -- like basically every other Kubernetes service with this problem, implement a voting mechanism so that a leader is elected. When a new Pod starts up it refuses to accept connections until it has verified that it is either the leader (in which case it is solely responsible for creation of any shared resources like TLS artifacts) or a follower (in which case it does nothing and just starts accepting connections).
@brudnak @vatsalparekh any ideas on what we are gonna do about this? imho this is actually a pretty big deal.
This should not be breaking down so easily.
SURE-8650
This will likely be split out into 2 or 3 separate issues. This first issue will cover making it so that the webhook can successfully be ran in HA mode (multiple pods). The other issue(s) will revolve around upstream webhook vs downstream webhook configuration(s).
We need to move this to future milestone and this needs an RFD and review cycle. This has significant layers to consider and impact on other functionalities.