aws-load-balancer-controller
aws-load-balancer-controller copied to clipboard
Allow configuration of the MutatingWebhook failure policy
Describe the bug I ran into issues with TLS certs being regenerated due to these bugs:
https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2312 https://github.com/kubernetes-sigs/aws-load-balancer-controller/pull/2264
Once the TLS certs changed, the MutatingWebhook for PodReadinessGate started failing and blocking the rollout of pods on services using this feature.
This was the error:
Error creating: Internal error occurred: failed calling webhook "mpod.elbv2.k8s.aws": Post "https://aws-lb-controller-webhook-service.kube-system.svc:443/mutate-v1-pod?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")
I think this exposes an availability concern because if all the pods backing a service get rescheduled while the mutatingwebhook is broken, the service will go down. My understanding is the PodReadinessGate is a bonus feature to make rollouts more smooth in Kubernetes and I think it'd be preferable for the feature to just not work rather than block rollouts all together.
Steps to reproduce
Break TLS certs on the LB controller while using PodReadinessGates, then reschedule pods backing an LB in that namespace.
Expected outcome I'd like to either be able to configure the webhooks failure policy or set it to fail open.
Environment
- AWS Load Balancer controller version: 2.4.1
- Kubernetes version: 1.21
- Using EKS (yes/no), if so version? Yes, platform version 7
Additional Context:
Thanks for requesting this feature. We can add an option to specify it.
/kind good-first-issue
@M00nF1sh: The label(s) kind/good-first-issue cannot be applied, because the repository doesn't have them.
In response to this:
Thanks for requesting this feature. We can add an option to specify it.
/kind good-first-issue
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/assign
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
This will be also useful when the TargetGroupBinding admission webhook fails. For example with this error message during helm upgrade:
Error: UPGRADE FAILED: failed to create resource: admission webhook "vtargetgroupbinding.elbv2.k8s.aws" denied the request: TargetGroup arn:aws:elasticloadbalancing:xxxxx:123456789:targetgroup/my-custom-name-of-target-group/8u738u4iojd23 is already bound to TargetGroupBinding prod/php-9fba15d9
It looks like this configuration option would be needed in the event of an availability zone failure when running a multi-az EKS cluster.
We ran into a similar issue when simulating a network AZ outage in our environment. We were surprised to see that even when all the nodes were failed over to healthy environments no new pods could start. Investigating further we saw errors about loadbalancer controller related mutating webhook. For some reason it stops working during AZ failure.
replicaset-controller Error creating: Internal error occurred: failed calling webhook "mpod.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-pod?timeout=10s": context deadline exceeded
All the pods in namespaces with the PodReadinessGates enabled were stuck and replica set controller was not able to create new pods. To work around it - we now need a human intervention and a procedure in place, where we disable the PodReadinessGates in the event of AZ failure to recover the cluster.
@M00nF1sh Maybe you could confirm if this feature will help in our scenario or we should open a new issue for this?
/assign
Closing as it appears this was addressed in #3653
/close
@josh-ferrell: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.