rancher [BUG] rancher-webhook does not run with high-availability

Rancher Server Setup

Rancher version: 2.7.5
Installation option (Docker install/Helm Chart):
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
Proxy/Cert Details:

Information about the Cluster

Kubernetes version: v1.27.6+rke2r1
Cluster Type (Local/Downstream): RKE2
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
- If custom, define the set of permissions:

Describe the bug

The rancher-webhook deployment only runs a single replica. To ensure high-availability of the component it should run with multiple replicas spread with pod-anti affinity & topology spread constraints.

In cases where the pod cannot be scheduled, or the availability zone (aws) where the pod is running is having issues it will impact the overall availability of rancher.

To Reproduce

Result

rancher-webhook currently runs as a single replica.

Expected Result

rancher-webhook deployment should be running with multiple replicas, pod anti-affinity, and topology spread constraints. Preferably all configurable via the primary rancher helm chart.

Screenshots

Additional context

Nov 12 '23 10:11 kgtw

+1 I'm really surprised there hasn't been more of an uproar about this. Any time the (single) rancher-webhook Pod gets evicted due to a Spot instance being reclaimed or something, the whole cluster goes haywire because the rancher-webhook admission webhook is configured (by Rancher) with failurePolicy: Fail, so any time rancher-webhook is unavailable, the whole API server breaks down.

Jan 18 '24 17:01 skaven81

Related: https://github.com/rancher/webhook/issues/365

May 28 '24 13:05 kgtw

Solution here is not as trivial as just increasing the replica count the webhook is currently responsible for managing its validatingwebhookconfigurations and mutatingwebhookconfigurations, their behavior needs to be verified when webhook is HA Similarly, webhook uses dynamiclistener for its TLS certs, it generates them, create Secret resources in the cluster and adds the TLS certs in the caBundle field of the *webhookconfigurations to tell the kube-apiserver what to trust. Again, if we simply increase the number of replicas, then its unclear what the behavior will be. Will they compete over which CA cert and TLS certs is the correct one? etc

Apr 01 '25 07:04 vatsalparekh

I think the answer is pretty clear to me -- like basically every other Kubernetes service with this problem, implement a voting mechanism so that a leader is elected. When a new Pod starts up it refuses to accept connections until it has verified that it is either the leader (in which case it is solely responsible for creation of any shared resources like TLS artifacts) or a follower (in which case it does nothing and just starts accepting connections).

Apr 01 '25 19:04 skaven81

@brudnak @vatsalparekh any ideas on what we are gonna do about this? imho this is actually a pretty big deal.

This should not be breaking down so easily.

Apr 04 '25 09:04 boindil

SURE-8650

Jun 06 '25 13:06 vincebrannon

This will likely be split out into 2 or 3 separate issues. This first issue will cover making it so that the webhook can successfully be ran in HA mode (multiple pods). The other issue(s) will revolve around upstream webhook vs downstream webhook configuration(s).

Sep 25 '25 14:09 crobby

We need to move this to future milestone and this needs an RFD and review cycle. This has significant layers to consider and impact on other functionalities.

Oct 14 '25 18:10 prachidamle