aws-load-balancer-controller
aws-load-balancer-controller copied to clipboard
Ingress group stuck with one ingress certificate error
Is your feature request related to a problem? I have been using ingress group to group about 30 ingresses in a single ALB. Each ingress has its own SSL certificate that was imported into ACM.
When a certificate is expired and for some reasons, we are unable to renew it, ALB Controller starts failing to reconcile the whole ingress group with logs like this:
{"level":"error","ts":1632213733.1759012,"logger":"controller","msg":"Reconciler error","controller":"ingress","name":"cluster-group-1","namespace":"","error":"ingress: group-1/expired-cert-ingress: none certificate found for host: expired-domain.net"}
This stops us to create new ingresses, or even delete old ones, since the ALB will not be updated. We must delete the ingress with expired certificate manually, or renew the certificate.
Describe the solution you'd like ALB Controller can ignore ingress with certificate failure and continue to reconcile other add/update/delete ingress actions.
Describe alternatives you've considered None
@thanhma, do you use auto-discovery? If so, does the controller resume once you import new certificate for the domain under question?
@kishorj
Yes, I use certificate auto-discovery. But if the certificate renewal takes time, or is not able to renew, it will affect reconciliation of other ingresses in the group.
This also fails if the certificate ARN that is set on the ingress does not exists.
How to test:
- Create an ingress and set the annotation for an non existent ACM Cert
alb.ingress.kubernetes.io/certificate-arn: .....
- Controller logs will look like:
{"level":"error","ts":1637353847.5984075,"logger":"controller","msg":"Reconciler error","controller":"ingress","name":"blox-presto","namespace":"prd2150","error":"CertificateNotFound: Certificate 'arn:aws:acm:us-east-1:-REDACTED-:certificate/4REDACTEDdc' not found\n\tstatus code: 400, request id: XXXXX"}
- The aws-load-balancer controller stops processing ANY updates to new or existing ingresses, we are running:
- aws-load-balancer-controller:v2.2.4
And I think is not related to ACM certificates, if any malformed ingress gets created in the cluster, the controller operation just halts.
This seems to be a regression, the alb-ingress-controller suffered from this and got fixed.
Let me know If I can help testing etc, @M00nF1sh tnx!.
@jescarri I don't think it's a regression since if the controller is behavior is always stop reconciliation if Ingress configuration is invalid. I'm assuming you are using IngressGroup, and the current behavior is if a single Ingress within IngressGroup contains invalid configuration, the entire IngressGroup will stop reconciliation, but other IngressGroups will be un-impacted(it would be a regression if other IngressGroups are impacted)
We do have plans to optimize this in the future, see https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2349
hrrm we are not using ingress-groups yet, so I guess all ingresses in the cluster are part of the default ingressGroup?
@jescarri No, by default ingresses didn't belong to any IngressGroup. So in your case a single non-exists cert prevents other ingress from reconcile? Let me do a test to confirm it.
@M00nF1sh yep and not only the certs, is any invalid alb setting like subnets or certs, timeouts etc.
@jescarri I just tested it and cannot reproduce the issue. Are you on kubernetes slack? if so, you can find me on @M00nF1sh and we can live debug there
hey @M00nF1sh sure, let me set up something, I will ping you there.
tnx!
hey @M00nF1sh we are still experiencing this, its a weird situation that takes time to develop.
We have seen it happen when this conditions happen:
- An ingress is created without certificate.
- That same ingress is updated with an invalid certificate ( Updates will fail for this ingress which is ok).
- Worker nodes are replaced ( this can be gradually ).
- Some previously healthy ingresses/albs stop being updated with the new worker nodes ( at some point the TG become empty).
Hope this helps!
hey @M00nF1sh I've made a new discovery, if the certificate exists but it has failed renewal.
The controller says the certificate does not exists/is not found, instead of just continuing the work.
Hi,
We've run into a similar issue as well.
We currently have 2 ingresses Ingress-1 has the following annotations
alb.ingress.kubernetes.io/backend-protocol-version: GRPC
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/target-type: ip
Ingress-2 has the following annotations
alb.ingress.kubernetes.io/target-type: ip
Certificate for Ingress-1 is not uploaded to ACM.
Steps:
- Create Ingress-1 Ingress object is created, load balancer is not assigned.. See a bunch of error in aws load balancer controller which are expected since the cert is not found
{"level":"error","ts":1648593364.4037778,"logger":"controller-runtime.manager.controller.ingress","msg":"Reconciler error","name":"corp","namespace":"","error":"ingress: prometheus-operator/thanos-query-grpc: none certificate found for host: thanos-grpc.dev-test8-labawsuse.lab.ppops.net"}
{"level":"error","ts":1648593364.5672228,"logger":"controller-runtime.manager.controller.ingress","msg":"Reconciler error","name":"corp","namespace":"","error":"ingress: prometheus-operator/thanos-query-grpc: none certificate found for host: thanos-grpc.dev-test8-labawsuse.lab.ppops.net"}
{"level":"error","ts":1648593364.8876278,"logger":"controller-runtime.manager.controller.ingress","msg":"Reconciler error","name":"corp","namespace":"","error":"ingress: prometheus-operator/thanos-query-grpc: none certificate found for host: thanos-grpc.dev-test8-labawsuse.lab.ppops.net"}
- Create Ingress-2 (same ingressclass/group) Ingress object is created, but load balance is not assinged even though there is no cert dependency
So basically any ingresses(same ingressclass/group) which are created after a problematic ingress are getting blocked. We can't even delete older working ingress which were created before the problematic ingress until unless we delete the problematic ingress.
This is a blocker for us. Can this issue be looked into soon?
Ideally ALB Controller should "ignore/ log appropriate errors" for ingresses with issues and continue to reconcile other add/update/delete ingress actions.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
is there any to manually delete the old config to get rid of error ? i had to use new alb gorup just because one one is stuck in error state and there is no way to delete it
We're seeing this too in our clusters running v2.4.6 ... what happened is someone deleted the cert referenced an ingress and the reconcile loop for the entire group then stops at that error, leaving other healthy ingress un-reconciled.
Of course the work-around is to not delete the referenced cert until the ingress is deleted.