aws-load-balancer-controller icon indicating copy to clipboard operation
aws-load-balancer-controller copied to clipboard

AWS load balancer controller takes a long time to start (to fully operate)

Open kwarunek opened this issue 2 years ago • 12 comments

Describe the bug We have deployed the aws-load-balancer-controller with 2 replicas. However, during the leader change (due to rotation or the TTL of the worker node), the other replica takes over leadership. This transition takes a considerable amount of time to fully complete operations. It appears that the new replica rebuilds the model and attempts to reconcile (reread all TG/ALB from AWS API to be precise) all configurations, even for the tg/alb that do not require it.

In the initial 10-20 minutes after starting or assuming leadership, any changes made to the endpoints experience delays and are not promptly reflected in the TG/ALB. This delay persists until the 'start/first_reconcile' process is finished.

This makes no sense to deploy more than 1 instance of controller.

From the metrics (attached screenshots) we deduce that it's probably due to API limits/throttling.

Expected outcome The model needs to be rebuilt, but reconciliation could potentially rely on a timestamp (endpoint last change vs tgb last reconcile) within the TargetGroupBinding. This approach would involve considering only the changes that have occurred.

Environment

  • AWS Load Balancer controller version 2.5.2
  • Kubernetes version 1.24
  • Using EKS: yes

Additional Context: Number of

  • provisioned ALB by controller: ~100
  • TargetGRoupBindings: ~ 1500

ALB controller args:

  • --cluster-name=zzzzzz
  • --ingress-class=alb
  • --aws-region=yy-xxxx-1
  • --aws-vpc-id=vpc-xxxxxxxx
  • --enable-shield=false

Screenshot 2023-08-10 at 00-41-47 AWS Load Balancer Controller - EKS - k8s-apps-v1 - Dashboards - Grafana Screenshot 2023-08-10 at 00-40-31 AWS Load Balancer Controller - EKS - k8s-apps-v1 - Dashboards - Grafana

kwarunek avatar Aug 09 '23 23:08 kwarunek

/area performance @kwarunek thanks for sharing this, this is a interesting problem as the standby replica only have cache for k8s Objects, but not cache for AWS objects(where the master replica have such cache in memory).

I think the timestamp is a good idea to skip already fully reconciled TGBs.

BTW, Would you help share the controller logs as well, would like to understand the operations done in the controller.

M00nF1sh avatar Aug 11 '23 18:08 M00nF1sh

@M00nF1sh I will prepare logs (info level) with redacted names

kwarunek avatar Aug 14 '23 17:08 kwarunek

@kwarunek, hi, would you consider enabling RGT API by via controller flag --feature-gates=EnableRGTAPI=true? It can avoid ELB API throttling issue and will help to reduce the reconcile time, especially for the case where there are numerous resources. You can check more about the feature gate flag in our release note and live doc. Please be mindful that RGT API does not work on private clusters.

oliviassss avatar Sep 05 '23 18:09 oliviassss

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 27 '24 17:01 k8s-triage-robot

@kwarunek We have also made some improvements around this area in v2.7.1. Could you please upgrade to this new version and see if this resolves your problem?

shraddhabang avatar Feb 14 '24 18:02 shraddhabang

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Mar 15 '24 19:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Apr 14 '24 19:04 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 14 '24 19:04 k8s-ci-robot

It's a bit better but still it takes ~10minutes

Screenshot 2024-05-13 at 14-26-16 AWS Load Balancer Controller - AWS - Dashboards - Grafana Screenshot 2024-05-13 at 14-26-49 AWS Load Balancer Controller - AWS - Dashboards - Grafana

kwarunek avatar May 13 '24 12:05 kwarunek

/reopen

kwarunek avatar May 13 '24 12:05 kwarunek

@kwarunek: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar May 13 '24 12:05 k8s-ci-robot

++ we are facing this issue right now. Our aws api also gets throttled

yuvraj9 avatar Jun 07 '24 06:06 yuvraj9

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jul 07 '24 07:07 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 07 '24 07:07 k8s-ci-robot

/reopen

There are customers affected by this, especially at bigger scale.

xdrus avatar Jul 08 '24 01:07 xdrus

@xdrus: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

There are customers affected by this, especially at bigger scale.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 08 '24 01:07 k8s-ci-robot

/reopen

kwarunek avatar Jul 31 '24 13:07 kwarunek

@kwarunek: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 31 '24 13:07 k8s-ci-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Aug 30 '24 14:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 30 '24 14:08 k8s-ci-robot

The fix for this was released under 2.9.1.

zac-nixon avatar Oct 15 '24 20:10 zac-nixon