aws-load-balancer-controller icon indicating copy to clipboard operation
aws-load-balancer-controller copied to clipboard

Ingress group stuck with one ingress certificate error

Open thanhma opened this issue 4 years ago • 16 comments

Is your feature request related to a problem? I have been using ingress group to group about 30 ingresses in a single ALB. Each ingress has its own SSL certificate that was imported into ACM.

When a certificate is expired and for some reasons, we are unable to renew it, ALB Controller starts failing to reconcile the whole ingress group with logs like this:

{"level":"error","ts":1632213733.1759012,"logger":"controller","msg":"Reconciler error","controller":"ingress","name":"cluster-group-1","namespace":"","error":"ingress: group-1/expired-cert-ingress: none certificate found for host: expired-domain.net"}

This stops us to create new ingresses, or even delete old ones, since the ALB will not be updated. We must delete the ingress with expired certificate manually, or renew the certificate.

Describe the solution you'd like ALB Controller can ignore ingress with certificate failure and continue to reconcile other add/update/delete ingress actions.

Describe alternatives you've considered None

thanhma avatar Sep 21 '21 15:09 thanhma

@thanhma, do you use auto-discovery? If so, does the controller resume once you import new certificate for the domain under question?

kishorj avatar Sep 21 '21 15:09 kishorj

@kishorj

Yes, I use certificate auto-discovery. But if the certificate renewal takes time, or is not able to renew, it will affect reconciliation of other ingresses in the group.

thanhma avatar Sep 21 '21 15:09 thanhma

This also fails if the certificate ARN that is set on the ingress does not exists.

How to test:

  • Create an ingress and set the annotation for an non existent ACM Cert
alb.ingress.kubernetes.io/certificate-arn: .....
  • Controller logs will look like:
{"level":"error","ts":1637353847.5984075,"logger":"controller","msg":"Reconciler error","controller":"ingress","name":"blox-presto","namespace":"prd2150","error":"CertificateNotFound: Certificate 'arn:aws:acm:us-east-1:-REDACTED-:certificate/4REDACTEDdc' not found\n\tstatus code: 400, request id: XXXXX"}

  • The aws-load-balancer controller stops processing ANY updates to new or existing ingresses, we are running:
  • aws-load-balancer-controller:v2.2.4

jescarri avatar Nov 19 '21 20:11 jescarri

And I think is not related to ACM certificates, if any malformed ingress gets created in the cluster, the controller operation just halts.

This seems to be a regression, the alb-ingress-controller suffered from this and got fixed.

Let me know If I can help testing etc, @M00nF1sh tnx!.

jescarri avatar Nov 19 '21 20:11 jescarri

@jescarri I don't think it's a regression since if the controller is behavior is always stop reconciliation if Ingress configuration is invalid. I'm assuming you are using IngressGroup, and the current behavior is if a single Ingress within IngressGroup contains invalid configuration, the entire IngressGroup will stop reconciliation, but other IngressGroups will be un-impacted(it would be a regression if other IngressGroups are impacted)

We do have plans to optimize this in the future, see https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2349

M00nF1sh avatar Nov 19 '21 21:11 M00nF1sh

hrrm we are not using ingress-groups yet, so I guess all ingresses in the cluster are part of the default ingressGroup?

jescarri avatar Nov 19 '21 21:11 jescarri

@jescarri No, by default ingresses didn't belong to any IngressGroup. So in your case a single non-exists cert prevents other ingress from reconcile? Let me do a test to confirm it.

M00nF1sh avatar Nov 19 '21 21:11 M00nF1sh

@M00nF1sh yep and not only the certs, is any invalid alb setting like subnets or certs, timeouts etc.

jescarri avatar Nov 19 '21 21:11 jescarri

@jescarri I just tested it and cannot reproduce the issue. Are you on kubernetes slack? if so, you can find me on @M00nF1sh and we can live debug there

M00nF1sh avatar Nov 19 '21 22:11 M00nF1sh

hey @M00nF1sh sure, let me set up something, I will ping you there.

tnx!

jescarri avatar Nov 19 '21 23:11 jescarri

hey @M00nF1sh we are still experiencing this, its a weird situation that takes time to develop.

We have seen it happen when this conditions happen:

  • An ingress is created without certificate.
  • That same ingress is updated with an invalid certificate ( Updates will fail for this ingress which is ok).
  • Worker nodes are replaced ( this can be gradually ).
  • Some previously healthy ingresses/albs stop being updated with the new worker nodes ( at some point the TG become empty).

Hope this helps!

jescarri avatar Jan 03 '22 18:01 jescarri

hey @M00nF1sh I've made a new discovery, if the certificate exists but it has failed renewal.

The controller says the certificate does not exists/is not found, instead of just continuing the work.

jescarri avatar Jan 03 '22 18:01 jescarri

Hi,

We've run into a similar issue as well.

We currently have 2 ingresses Ingress-1 has the following annotations

   alb.ingress.kubernetes.io/backend-protocol-version: GRPC
   alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
   alb.ingress.kubernetes.io/target-type: ip

Ingress-2 has the following annotations alb.ingress.kubernetes.io/target-type: ip

Certificate for Ingress-1 is not uploaded to ACM.

Steps:

  1. Create Ingress-1 Ingress object is created, load balancer is not assigned.. See a bunch of error in aws load balancer controller which are expected since the cert is not found
   {"level":"error","ts":1648593364.4037778,"logger":"controller-runtime.manager.controller.ingress","msg":"Reconciler error","name":"corp","namespace":"","error":"ingress: prometheus-operator/thanos-query-grpc: none certificate found for host: thanos-grpc.dev-test8-labawsuse.lab.ppops.net"}
{"level":"error","ts":1648593364.5672228,"logger":"controller-runtime.manager.controller.ingress","msg":"Reconciler error","name":"corp","namespace":"","error":"ingress: prometheus-operator/thanos-query-grpc: none certificate found for host: thanos-grpc.dev-test8-labawsuse.lab.ppops.net"}
{"level":"error","ts":1648593364.8876278,"logger":"controller-runtime.manager.controller.ingress","msg":"Reconciler error","name":"corp","namespace":"","error":"ingress: prometheus-operator/thanos-query-grpc: none certificate found for host: thanos-grpc.dev-test8-labawsuse.lab.ppops.net"}
  1. Create Ingress-2 (same ingressclass/group) Ingress object is created, but load balance is not assinged even though there is no cert dependency

So basically any ingresses(same ingressclass/group) which are created after a problematic ingress are getting blocked. We can't even delete older working ingress which were created before the problematic ingress until unless we delete the problematic ingress.

This is a blocker for us. Can this issue be looked into soon?

Ideally ALB Controller should "ignore/ log appropriate errors" for ingresses with issues and continue to reconcile other add/update/delete ingress actions.

vasu-git avatar Mar 30 '22 17:03 vasu-git

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 28 '22 17:06 k8s-triage-robot

/remove-lifecycle stale

M00nF1sh avatar Jun 30 '22 16:06 M00nF1sh

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 28 '22 16:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 28 '22 17:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Nov 27 '22 17:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Nov 27 '22 17:11 k8s-ci-robot

is there any to manually delete the old config to get rid of error ? i had to use new alb gorup just because one one is stuck in error state and there is no way to delete it

mk2134226 avatar Feb 09 '23 04:02 mk2134226

We're seeing this too in our clusters running v2.4.6 ... what happened is someone deleted the cert referenced an ingress and the reconcile loop for the entire group then stops at that error, leaving other healthy ingress un-reconciled.

Of course the work-around is to not delete the referenced cert until the ingress is deleted.

thelabdude avatar Mar 21 '23 19:03 thelabdude