aws-load-balancer-controller icon indicating copy to clipboard operation
aws-load-balancer-controller copied to clipboard

Pod Readiness Gate does not exist when pod is ready without restart or in 16m

Open ibalat opened this issue 11 months ago • 6 comments

Describe the bug I use pod readiness gate with alb ip mode. When a new pod is ready (after deployment or scaling), pod readiness gate is waiting too long (~15-16mins) and pod status show "corresponding condition of pod readiness gate target-health.elbv2.k8s.aws/k8s-test-testapp-28c9478fae does not exist." with ReadinessGatesNotReady reason.

  1. When I restart alb-controller, all pod readinesses are being ready. I guess it try again failed pod readinesses and add to alb
  2. I notice that pod readinesses are being ready after targetgroupbindingMaxExponentialBackoffDelay (16m40s) and I decreased this value to 10s. After this change, problem solved and at 11th second all pod readinesses be ready. But it is not a good solution. I don't know this parameter effects which statuses. Maybe it breaks another things. Btw, why it is 16m40s? Is it a reason for this?
  3. While pod readinesses are waiting, endpoints of service at NotReadyAddresses for waiting pods and there is a error on endpoints events Failed to update endpoint test/test-app: Operation cannot be fulfilled on endpoints "test-app": the object has been modified; please apply your changes to the latest version and try again

image

  1. I watch alb controller logs during all steps in info mode and no any error or warning

This is so complex status. Please someone explain it or give a solution.

Steps to reproduce

  1. Alb mode at ip
  2. activate slow_start.duration_seconds=60s
  3. create a new pod (deploy or scaling etc.) and watch its pod readinesses

Expected outcome pod readinesses have to be ready after pod is ready in a short time

Environment

  • AWS Load Balancer controller version: v2.7.1 (I also tested with 2.6.0 and 2.6.2, same problem)
  • Kubernetes version: v1.29.0
  • Using EKS (yes/no), if so version? : yes, v1.29.0

ibalat avatar Mar 22 '24 08:03 ibalat

Hi, appreciate you bringing this to our attention. We will be working to reproduce this issue and investigating further, as it is unusual for the back off delay to be that value or take that long. Thank you for your understanding and patience.

huangm777 avatar Mar 26 '24 21:03 huangm777

/kind bug

huangm777 avatar Mar 27 '24 20:03 huangm777

hi @huangm777 , do you have any update for this issue?

ibalat avatar Apr 15 '24 05:04 ibalat

hi @kishorj , why targetgroupbinding-max-exponential-backoff-delay value is 16m 40s? If I set as ~1m, will effect or break another thing?

ibalat avatar Apr 26 '24 13:04 ibalat

We set the default value to 1000s(or 16m40s) is to follow the upstream default: https://github.com/kubernetes/client-go/blob/62f959700d559dd8a33c1f692cb34219cfef930f/util/workqueue/default_rate_limiters.go#L52.

The value is to cap the max delay to retry a failed item, if decreasing the value to 1-3m, you will have a faster retry attempts, but it may overwhelm the workqueue, and lead to potential load increase. Better to test it in a dev environment and fine tune the value.

oliviassss avatar May 14 '24 19:05 oliviassss

If I have no failed item, when I set every minute, no job works, right? Otherwise, when overwhelm the workqueue, If I increase the pod resources, can it continue correctly?

ibalat avatar May 15 '24 08:05 ibalat

We are experiencing exactly the same behavior. I can provide outputs or help troubleshoot if needed. We have multiple deployments that experience this.

othatbrian avatar Jul 30 '24 14:07 othatbrian

@othatbrian What is the workaround you followed to mitigate the issue

veludcx avatar Sep 18 '24 07:09 veludcx

@othatbrian What is the workaround you followed to mitigate the issue

I added --targetgroupbinding-max-exponential-backoff-delay=60s to the command line arguments for our aws-load-balancer-controller.

othatbrian avatar Sep 18 '24 13:09 othatbrian

Is this resolved? We are facing the same issue

jyoti-io avatar Sep 26 '24 12:09 jyoti-io

We have the very same situation and we are looking forward to see if this issue get any output that could lead us to a final resolution.

ItaloChagas avatar Sep 30 '24 17:09 ItaloChagas

We are using --targetgroupbinding-max-exponential-backoff-delay=60s while long time and it is working, no another problem for now

ibalat avatar Oct 01 '24 07:10 ibalat

Hi folks! We have released 2.9.1 that should address this exact issue - https://github.com/kubernetes-sigs/aws-load-balancer-controller/releases/tag/v2.9.1 without the need to tune the flag targetgroupbinding-max-exponential-backoff-delay

zac-nixon avatar Oct 15 '24 17:10 zac-nixon