aws-load-balancer-controller
aws-load-balancer-controller copied to clipboard
AWS load balancer controller takes a long time to start (to fully operate)
Describe the bug We have deployed the aws-load-balancer-controller with 2 replicas. However, during the leader change (due to rotation or the TTL of the worker node), the other replica takes over leadership. This transition takes a considerable amount of time to fully complete operations. It appears that the new replica rebuilds the model and attempts to reconcile (reread all TG/ALB from AWS API to be precise) all configurations, even for the tg/alb that do not require it.
In the initial 10-20 minutes after starting or assuming leadership, any changes made to the endpoints experience delays and are not promptly reflected in the TG/ALB. This delay persists until the 'start/first_reconcile' process is finished.
This makes no sense to deploy more than 1 instance of controller.
From the metrics (attached screenshots) we deduce that it's probably due to API limits/throttling.
Expected outcome The model needs to be rebuilt, but reconciliation could potentially rely on a timestamp (endpoint last change vs tgb last reconcile) within the TargetGroupBinding. This approach would involve considering only the changes that have occurred.
Environment
- AWS Load Balancer controller version 2.5.2
- Kubernetes version 1.24
- Using EKS: yes
Additional Context: Number of
- provisioned ALB by controller: ~100
- TargetGRoupBindings: ~ 1500
ALB controller args:
- --cluster-name=zzzzzz
- --ingress-class=alb
- --aws-region=yy-xxxx-1
- --aws-vpc-id=vpc-xxxxxxxx
- --enable-shield=false
/area performance @kwarunek thanks for sharing this, this is a interesting problem as the standby replica only have cache for k8s Objects, but not cache for AWS objects(where the master replica have such cache in memory).
I think the timestamp is a good idea to skip already fully reconciled TGBs.
BTW, Would you help share the controller logs as well, would like to understand the operations done in the controller.
@M00nF1sh I will prepare logs (info level) with redacted names
@kwarunek, hi, would you consider enabling RGT API by via controller flag --feature-gates=EnableRGTAPI=true? It can avoid ELB API throttling issue and will help to reduce the reconcile time, especially for the case where there are numerous resources. You can check more about the feature gate flag in our release note and live doc. Please be mindful that RGT API does not work on private clusters.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
@kwarunek We have also made some improvements around this area in v2.7.1. Could you please upgrade to this new version and see if this resolves your problem?
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
It's a bit better but still it takes ~10minutes
/reopen
@kwarunek: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
++ we are facing this issue right now. Our aws api also gets throttled
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/reopen
There are customers affected by this, especially at bigger scale.
@xdrus: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
There are customers affected by this, especially at bigger scale.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/reopen
@kwarunek: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
The fix for this was released under 2.9.1.