aws-load-balancer-controller icon indicating copy to clipboard operation
aws-load-balancer-controller copied to clipboard

Allow configuration of the MutatingWebhook failure policy

Open sidewinder12s opened this issue 3 years ago • 3 comments

Describe the bug I ran into issues with TLS certs being regenerated due to these bugs:

https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2312 https://github.com/kubernetes-sigs/aws-load-balancer-controller/pull/2264

Once the TLS certs changed, the MutatingWebhook for PodReadinessGate started failing and blocking the rollout of pods on services using this feature.

This was the error:

Error creating: Internal error occurred: failed calling webhook "mpod.elbv2.k8s.aws": Post "https://aws-lb-controller-webhook-service.kube-system.svc:443/mutate-v1-pod?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")

I think this exposes an availability concern because if all the pods backing a service get rescheduled while the mutatingwebhook is broken, the service will go down. My understanding is the PodReadinessGate is a bonus feature to make rollouts more smooth in Kubernetes and I think it'd be preferable for the feature to just not work rather than block rollouts all together.

Steps to reproduce

Break TLS certs on the LB controller while using PodReadinessGates, then reschedule pods backing an LB in that namespace.

Expected outcome I'd like to either be able to configure the webhooks failure policy or set it to fail open.

Environment

  • AWS Load Balancer controller version: 2.4.1
  • Kubernetes version: 1.21
  • Using EKS (yes/no), if so version? Yes, platform version 7

Additional Context:

sidewinder12s avatar Jul 01 '22 17:07 sidewinder12s

Thanks for requesting this feature. We can add an option to specify it.

/kind good-first-issue

M00nF1sh avatar Jul 14 '22 17:07 M00nF1sh

@M00nF1sh: The label(s) kind/good-first-issue cannot be applied, because the repository doesn't have them.

In response to this:

Thanks for requesting this feature. We can add an option to specify it.

/kind good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jul 14 '22 17:07 k8s-ci-robot

/assign

fabianberisha avatar Sep 05 '22 14:09 fabianberisha

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 08 '23 02:02 k8s-triage-robot

This will be also useful when the TargetGroupBinding admission webhook fails. For example with this error message during helm upgrade:

Error: UPGRADE FAILED: failed to create resource: admission webhook "vtargetgroupbinding.elbv2.k8s.aws" denied the request: TargetGroup arn:aws:elasticloadbalancing:xxxxx:123456789:targetgroup/my-custom-name-of-target-group/8u738u4iojd23 is already bound to TargetGroupBinding prod/php-9fba15d9

reixd avatar Nov 14 '23 10:11 reixd

It looks like this configuration option would be needed in the event of an availability zone failure when running a multi-az EKS cluster.

We ran into a similar issue when simulating a network AZ outage in our environment. We were surprised to see that even when all the nodes were failed over to healthy environments no new pods could start. Investigating further we saw errors about loadbalancer controller related mutating webhook. For some reason it stops working during AZ failure.

replicaset-controller Error creating: Internal error occurred: failed calling webhook "mpod.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-pod?timeout=10s": context deadline exceeded

All the pods in namespaces with the PodReadinessGates enabled were stuck and replica set controller was not able to create new pods. To work around it - we now need a human intervention and a procedure in place, where we disable the PodReadinessGates in the event of AZ failure to recover the cluster.

@M00nF1sh Maybe you could confirm if this feature will help in our scenario or we should open a new issue for this?

juozasget avatar Nov 21 '23 12:11 juozasget

/assign

josh-ferrell avatar Feb 07 '24 13:02 josh-ferrell

Closing as it appears this was addressed in #3653

josh-ferrell avatar Apr 30 '24 12:04 josh-ferrell

/close

josh-ferrell avatar Apr 30 '24 12:04 josh-ferrell

@josh-ferrell: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 30 '24 12:04 k8s-ci-robot