aws-load-balancer-controller AWS load balancer controller takes a long time to start (to fully operate)

Describe the bug We have deployed the aws-load-balancer-controller with 2 replicas. However, during the leader change (due to rotation or the TTL of the worker node), the other replica takes over leadership. This transition takes a considerable amount of time to fully complete operations. It appears that the new replica rebuilds the model and attempts to reconcile (reread all TG/ALB from AWS API to be precise) all configurations, even for the tg/alb that do not require it.

In the initial 10-20 minutes after starting or assuming leadership, any changes made to the endpoints experience delays and are not promptly reflected in the TG/ALB. This delay persists until the 'start/first_reconcile' process is finished.

This makes no sense to deploy more than 1 instance of controller.

From the metrics (attached screenshots) we deduce that it's probably due to API limits/throttling.

Expected outcome The model needs to be rebuilt, but reconciliation could potentially rely on a timestamp (endpoint last change vs tgb last reconcile) within the TargetGroupBinding. This approach would involve considering only the changes that have occurred.

Environment

AWS Load Balancer controller version 2.5.2
Kubernetes version 1.24
Using EKS: yes

Additional Context: Number of

provisioned ALB by controller: ~100
TargetGRoupBindings: ~ 1500

ALB controller args:

--cluster-name=zzzzzz
--ingress-class=alb
--aws-region=yy-xxxx-1
--aws-vpc-id=vpc-xxxxxxxx
--enable-shield=false

Screenshot 2023-08-10 at 00-41-47 AWS Load Balancer Controller - EKS - k8s-apps-v1 - Dashboards - Grafana Screenshot 2023-08-10 at 00-40-31 AWS Load Balancer Controller - EKS - k8s-apps-v1 - Dashboards - Grafana

Aug 09 '23 23:08 kwarunek

/area performance @kwarunek thanks for sharing this, this is a interesting problem as the standby replica only have cache for k8s Objects, but not cache for AWS objects(where the master replica have such cache in memory).

I think the timestamp is a good idea to skip already fully reconciled TGBs.

BTW, Would you help share the controller logs as well, would like to understand the operations done in the controller.

Aug 11 '23 18:08 M00nF1sh

@M00nF1sh I will prepare logs (info level) with redacted names

Aug 14 '23 17:08 kwarunek

@kwarunek, hi, would you consider enabling RGT API by via controller flag --feature-gates=EnableRGTAPI=true? It can avoid ELB API throttling issue and will help to reduce the reconcile time, especially for the case where there are numerous resources. You can check more about the feature gate flag in our release note and live doc. Please be mindful that RGT API does not work on private clusters.

Sep 05 '23 18:09 oliviassss

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 27 '24 17:01 k8s-triage-robot

@kwarunek We have also made some improvements around this area in v2.7.1. Could you please upgrade to this new version and see if this resolves your problem?

Feb 14 '24 18:02 shraddhabang

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 15 '24 19:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Apr 14 '24 19:04 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 14 '24 19:04 k8s-ci-robot

It's a bit better but still it takes ~10minutes

Screenshot 2024-05-13 at 14-26-16 AWS Load Balancer Controller - AWS - Dashboards - Grafana

May 13 '24 12:05 kwarunek

/reopen

May 13 '24 12:05 kwarunek

@kwarunek: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

May 13 '24 12:05 k8s-ci-robot

++ we are facing this issue right now. Our aws api also gets throttled

Jun 07 '24 06:06 yuvraj9

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jul 07 '24 07:07 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jul 07 '24 07:07 k8s-ci-robot

/reopen

There are customers affected by this, especially at bigger scale.

Jul 08 '24 01:07 xdrus

@xdrus: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

There are customers affected by this, especially at bigger scale.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jul 08 '24 01:07 k8s-ci-robot

/reopen

Jul 31 '24 13:07 kwarunek

@kwarunek: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jul 31 '24 13:07 k8s-ci-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Aug 30 '24 14:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Aug 30 '24 14:08 k8s-ci-robot

The fix for this was released under 2.9.1.

Oct 15 '24 20:10 zac-nixon

aws-load-balancer-controller aws-load-balancer-controller copied to clipboard

AWS load balancer controller takes a long time to start (to fully operate)

aws-load-balancer-controller
aws-load-balancer-controller copied to clipboard