aws-load-balancer-controller Getting 502/504 with Pod Readiness Gates during rolling updates

I'm making use of the Pod Readiness Gate on Kubernetes Deployments running Golang-based APIs. The goal is to achieve full zero downtime deployments.

During a rolling update of the Kubernetes Deployment, I'm getting 502/504 responses from these APIs. This did not happen when setting target-type: instance.

I believe the problem is that AWS does not drain the pod from the LB before Kubernetes terminates it

Timeline of events:

Perform a rolling update on the deployment (1 replica)
A second pod is created in the deployment
AWS registers a second target in the Load Balancing Target Group
Both pods begin receiving traffic
I'm not sure what happens first at this point: a. AWS begins de-registered/drained the target b. Kubernetes begins terminating the pod
Traffic sent to the deployment begins receiving 502 and 504 errors
The old pod is deleted
Traffic returns to normal (200)
The target is de-registered/drained (depending on delay)

This is tested with a looping curl command:

while true; do
  curl --write-out '%{url_effective} - %{http_code} -' --silent --output /dev/null -L https://example.com | pv -N "$(date +"%T")" -t
  sleep 1
done

Results:

https://example.com - 200 - 13:04:16: 0:00:00
https://example.com - 502 - 13:04:17: 0:00:01
https://example.com - 200 - 13:04:20: 0:00:00
https://example.com - 504 - 13:04:31: 0:00:10
https://example.com - 200 - 13:04:32: 0:00:00
https://example.com - 200 - 13:04:33: 0:00:00
https://example.com - 200 - 13:04:34: 0:00:00
https://example.com - 200 - 13:04:35: 0:00:00
https://example.com - 200 - 13:04:36: 0:00:00

Dec 10 '20 02:12 calvinbui

We've been having the same issue. We confirmed with AWS that there is some propagation time between when some target is marked draining in a target group, and when that target actually stops receiving new connections. So, at the suggestion of other issues I've seen in the old project for this, we added a 20s sleep in a preStop script. This hasn't entirely eliminated them though, they still happen on deployment, just not with as much volume. Following this to see if anyone else has any good ideas, as troubleshooting these 502s has been infuriatingly difficult.

Dec 11 '20 20:12 AirbornePorcine

@calvinbui The pods needs to have a preStop hook to sleep. since most web framework(e.g. nginx/apache) will stop accept new connections once requested soft stop(sigTerm). and it take some time for the controller to deregister pod(after got endpoint change event), and take time for elb to propagate target changes to it's dataplane.

@AirbornePorcine did you still saw 502 with 20s sleep? have you enabled pod readinessGate? If you are using instance mode, u need 30 second extra sleep(since kubeproxy update iptable rules per 30 second).

Dec 11 '20 20:12 M00nF1sh

@M00nF1sh that's correct, even with a 20s sleep and the auto-injected readinessGate, doing a rolling restart of my pods results in a small amount of 502s. For reference this is like 5-6 502s out of 1m total requests in the same time period, so a very small amount, but still not something we want. I'm using IP mode here.

Dec 11 '20 20:12 AirbornePorcine

@AirbornePorcine in my own test, the sum of controller process time(from pod kill to target deregistered) and ELB API propagation time(from deregister API call to targets actually removed from ELB dataplane) takes less than 10 second.

And the PreStop hook sleep only need to be controller process time + ELB API propagation time + HTTP req/resp RTT.

Just asked ELB team whether they have p90/p99 metrics available for ELB API propagation time. If so, we recommend a safe PreStop sleep.

Dec 11 '20 21:12 M00nF1sh

Ok, so, we just did some additional testing on that sleep timing.

The only way we've been able to get zero 502s during a rolling deploy, is to set our preStop sleep to the target group's deregistration delay + at least 5s. It seems almost like there's no way to guarantee that AWS isn't actually sending you new requests, until the target is fully removed from the target group, and not just marked "draining".

Looking back in my emails, I realized this is exactly what AWS support had previously told us to do - don't stop the target from processing requests until the target group deregistration delay has elapsed at minimum (we added the 5s to account for the controller process and propagation time as you mentioned).

Next week we'll try tweaking our deregistration delay and see if the same holds true (it's currently 60s, but we really don't want to sleep that long if we can avoid it)

Something you might want to try though @calvinbui!

Dec 11 '20 22:12 AirbornePorcine

Thanks for the comments.

Adding a preStop and sleep, I was able to get all 200s during a rolling update of the deployment. I set deregistration time to 20 seconds and sleep to 30 seconds.

However during a node upgrade/rolling update I got 503s for around one minute. Are there any recommendations from AWS about that? I'm guessing I would need to bump up the deregistration and probably the sleep times a lot higher to allow the new node to fire up and the new pods to start as well.

Dec 15 '20 00:12 calvinbui

After increasing sleep to 90s and terminationGracePeriod to 120s there are no downtimes during a cluster upgrade/node upgrade on EKS.

However, if a deployment only has 1 replica, there is still ~1 min of downtime. For deployments with >=2 replicas, this was not a problem and no downtime was observed.

The documentation should be updated, so I'll leave this issue open.

EDIT: For the 1 replica issue, it was because k8s doesn't do a rolling deployment during a cluster/node upgrade. It is considered involuntary so I had to scale up to 2 replicas and add a PDB

Dec 24 '20 00:12 calvinbui

How about abusing (?) validationAdmissionWebHook for delaying pod deletion? Here's the sketch of the idea:

ValidataionAdmissionWebhook intercepts pod deletion. It won't allow deletion of the pod if the pod is is reachable from the alb, ip type ingress first.
However, it patches the pod. It removes labels and ownerReferences so it is removed from ReplicationSet and Endpoint. Also ELB starts draining since it is removed from Endpoint.
After some time passes, and ELB finishes its draining, the pod is deleted by aws-load-balancer-controller.

edit: I've implemented this idea into a chart here. https://github.com/foriequal0/pod-graceful-drain

Jan 20 '21 17:01 foriequal0

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 26 '21 02:09 k8s-triage-robot

This is still a serious issue, any update on it? We use currently the solution from @foriequal0 which is really doing a great job so far. I wish this would be officially handled by the controller project itself.

Dec 08 '21 09:12 project0

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 08 '22 09:03 k8s-triage-robot

/remove-lifecycle stale

Mar 08 '22 10:03 project0

What's the protocol for getting this prioritized? We've hit it as well. This is a serious issue and while I understand there's a workaround (hack), it's certainly reducing my confidence in running production workloads on this thing.

Mar 25 '22 17:03 ardove

I'm also seeing this issue, but I think it's not necessarily an issue with the LB Controller? It seems draining for NLBs doesn't work as I would have expected. Instead of stopping new connections and letting existing connections continue it continues to send new connections to the draining targets for a while.

From my testing the actual delay for a target to be fully de-registered and drained seems to be around 2-3 minutes.

Adding this to each container exposed behind an NLB have worked for me so far.

          lifecycle:
            preStop:
              exec:
                command: [ sh, -c, "sleep 180" ]

I would love to be able to get rid of this but it simply seems that the NLBs are extremely slow in performing management operations. I have even seen target registrations take almost 10 minutes.

May 10 '22 11:05 albgus

I completely agree with what @ardove has said.

The point of this readinessGate feature is to delay the termination of the pod as long as the LB needs it. If I have to update my chart to put a sleep in the preStop hook then it means that this feature is not working. If I have to use the preStop hook then i might as well not even use this readinessGate feature.

In my observation the pod is allowed to terminate as soon as the new target group becomes ready/healthy. I have seen that the old target group was still draining after the pod terminates and obviously that's going to result in 502 errors for those requests.

This feature almost works. Without the feature enabled I see 30 seconds to 1 minute of solid 502 errors. With the feature enabled I get a brief sluggishness and maybe 1 or a handful of 502's. Hopefully you can get this fixed because unfortunately close to good isn't good enough for something like this.

Jul 15 '22 17:07 dfinucane

I thought it might be useful to share this KubeCon talk, "The Gotchas of Zero-Downtime Traffic /w Kubernetes", where the speaker goes into the strategies required for zero-downtime rolling updates with Kubernetes deployments (at least as of 2022):

https://www.youtube.com/watch?v=0o5C12kzEDI

It can be a bit hard to conceptualise the limitations of the async nature of Ingress/Endpoint objects and Pod termination, so I found the above talk (and live demo) helped a lot.

Hopefully it's useful for others.

Jul 18 '22 12:07 aaron-hastings-travelport

@M00nF1sh I am implementing the same in my kubernetes cluster but unable to calculate the sleep time for prestop hook and terminationGracePeriodSeconds. Currently terminationGracePeriodSeconds is 120 seconds, deregistration delay is 300 seconds.Do we have any mechanism to calculate this?

Aug 12 '22 13:08 jyotibhanot

Does anyone have a update on this? After almost two years i cannot see that it has been solved natively yet.

Nov 22 '22 15:11 project0

I wonder if finalizers would solve this problem nicely here :thinking:

Nov 22 '22 17:11 project0

For clusters using traefik proxy as ingress it might be worth looking also into the entrypoint lifecycle feature to control graceful shutdowns https://doc.traefik.io/traefik/routing/entrypoints/#lifecycle. At least in this case it avoids the need for the sleep workaround :-)

Dec 07 '22 16:12 project0

https://www.reddit.com/r/ProgrammerHumor/comments/1092kmf/just_add_sleep/j3vqiv2?utm_medium=android_app&utm_source=share&context=3

Jan 11 '23 17:01 smulikHakipod

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 11 '23 18:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

May 11 '23 18:05 k8s-triage-robot

would EndpointSlice terminating condition solve this issue? it says "Consumers of the EndpointSlice API, such as Kube-proxy and Ingress Controllers, can now use these conditions to coordinate connection draining events, by continuing to forward traffic for existing connections but rerouting new connections to other non-terminating endpoints." but i'm not sure it would work too in this case

https://kubernetes.io/blog/2022/12/30/advancements-in-kubernetes-traffic-engineering/

May 13 '23 11:05 dongho-jung

/remove-lifecycle rotten

May 13 '23 15:05 ThisIsQasim

Bumping this issue. Adding sleep() does not sound professional, it's a workaround and only workaround :/

Jun 28 '23 07:06 rkubik-hostersi

I am experiencing this issue, too.

Nov 22 '23 10:11 dusansusic

any update? does pod readiness gate work w/ v2.6 ?

Nov 28 '23 20:11 OverStruck

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 26 '24 21:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 27 '24 22:03 k8s-triage-robot

aws-load-balancer-controller aws-load-balancer-controller copied to clipboard

Getting 502/504 with Pod Readiness Gates during rolling updates

aws-load-balancer-controller
aws-load-balancer-controller copied to clipboard