aws-load-balancer-controller
aws-load-balancer-controller copied to clipboard
Getting 502/504 with Pod Readiness Gates during rolling updates
I'm making use of the Pod Readiness Gate on Kubernetes Deployments running Golang-based APIs. The goal is to achieve full zero downtime deployments.
During a rolling update of the Kubernetes Deployment, I'm getting 502/504 responses from these APIs. This did not happen when setting target-type: instance.
I believe the problem is that AWS does not drain the pod from the LB before Kubernetes terminates it
Timeline of events:
- Perform a rolling update on the deployment (1 replica)
- A second pod is created in the deployment
- AWS registers a second target in the Load Balancing Target Group
- Both pods begin receiving traffic
- I'm not sure what happens first at this point: a. AWS begins de-registered/drained the target b. Kubernetes begins terminating the pod
- Traffic sent to the deployment begins receiving 502 and 504 errors
- The old pod is deleted
- Traffic returns to normal (200)
- The target is de-registered/drained (depending on delay)
This is tested with a looping curl command:
while true; do
curl --write-out '%{url_effective} - %{http_code} -' --silent --output /dev/null -L https://example.com | pv -N "$(date +"%T")" -t
sleep 1
done
Results:
https://example.com - 200 - 13:04:16: 0:00:00
https://example.com - 502 - 13:04:17: 0:00:01
https://example.com - 200 - 13:04:20: 0:00:00
https://example.com - 504 - 13:04:31: 0:00:10
https://example.com - 200 - 13:04:32: 0:00:00
https://example.com - 200 - 13:04:33: 0:00:00
https://example.com - 200 - 13:04:34: 0:00:00
https://example.com - 200 - 13:04:35: 0:00:00
https://example.com - 200 - 13:04:36: 0:00:00
We've been having the same issue. We confirmed with AWS that there is some propagation time between when some target is marked draining in a target group, and when that target actually stops receiving new connections. So, at the suggestion of other issues I've seen in the old project for this, we added a 20s sleep in a preStop script. This hasn't entirely eliminated them though, they still happen on deployment, just not with as much volume. Following this to see if anyone else has any good ideas, as troubleshooting these 502s has been infuriatingly difficult.
@calvinbui The pods needs to have a preStop hook to sleep. since most web framework(e.g. nginx/apache) will stop accept new connections once requested soft stop(sigTerm). and it take some time for the controller to deregister pod(after got endpoint change event), and take time for elb to propagate target changes to it's dataplane.
@AirbornePorcine did you still saw 502 with 20s sleep? have you enabled pod readinessGate? If you are using instance mode, u need 30 second extra sleep(since kubeproxy update iptable rules per 30 second).
@M00nF1sh that's correct, even with a 20s sleep and the auto-injected readinessGate, doing a rolling restart of my pods results in a small amount of 502s. For reference this is like 5-6 502s out of 1m total requests in the same time period, so a very small amount, but still not something we want. I'm using IP mode here.
@AirbornePorcine in my own test, the sum of controller process time(from pod kill to target deregistered) and ELB API propagation time(from deregister API call to targets actually removed from ELB dataplane) takes less than 10 second.
And the PreStop hook sleep only need to be controller process time + ELB API propagation time + HTTP req/resp RTT.
Just asked ELB team whether they have p90/p99 metrics available for ELB API propagation time. If so, we recommend a safe PreStop sleep.
Ok, so, we just did some additional testing on that sleep timing.
The only way we've been able to get zero 502s during a rolling deploy, is to set our preStop sleep to the target group's deregistration delay + at least 5s. It seems almost like there's no way to guarantee that AWS isn't actually sending you new requests, until the target is fully removed from the target group, and not just marked "draining".
Looking back in my emails, I realized this is exactly what AWS support had previously told us to do - don't stop the target from processing requests until the target group deregistration delay has elapsed at minimum (we added the 5s to account for the controller process and propagation time as you mentioned).
Next week we'll try tweaking our deregistration delay and see if the same holds true (it's currently 60s, but we really don't want to sleep that long if we can avoid it)
Something you might want to try though @calvinbui!
Thanks for the comments.
Adding a preStop and sleep, I was able to get all 200s during a rolling update of the deployment. I set deregistration time to 20 seconds and sleep to 30 seconds.
However during a node upgrade/rolling update I got 503s for around one minute. Are there any recommendations from AWS about that? I'm guessing I would need to bump up the deregistration and probably the sleep times a lot higher to allow the new node to fire up and the new pods to start as well.
After increasing sleep to 90s and terminationGracePeriod to 120s there are no downtimes during a cluster upgrade/node upgrade on EKS.
However, if a deployment only has 1 replica, there is still ~1 min of downtime. For deployments with >=2 replicas, this was not a problem and no downtime was observed.
The documentation should be updated, so I'll leave this issue open.
EDIT: For the 1 replica issue, it was because k8s doesn't do a rolling deployment during a cluster/node upgrade. It is considered involuntary so I had to scale up to 2 replicas and add a PDB
How about abusing (?) validationAdmissionWebHook for delaying pod deletion? Here's the sketch of the idea:
- ValidataionAdmissionWebhook intercepts pod deletion. It won't allow deletion of the pod if the pod is is reachable from the alb,
iptype ingress first. - However, it patches the pod. It removes labels and ownerReferences so it is removed from ReplicationSet and Endpoint. Also ELB starts draining since it is removed from Endpoint.
- After some time passes, and ELB finishes its draining, the pod is deleted by aws-load-balancer-controller.
edit: I've implemented this idea into a chart here. https://github.com/foriequal0/pod-graceful-drain
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
This is still a serious issue, any update on it? We use currently the solution from @foriequal0 which is really doing a great job so far. I wish this would be officially handled by the controller project itself.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
What's the protocol for getting this prioritized? We've hit it as well. This is a serious issue and while I understand there's a workaround (hack), it's certainly reducing my confidence in running production workloads on this thing.
I'm also seeing this issue, but I think it's not necessarily an issue with the LB Controller? It seems draining for NLBs doesn't work as I would have expected. Instead of stopping new connections and letting existing connections continue it continues to send new connections to the draining targets for a while.
From my testing the actual delay for a target to be fully de-registered and drained seems to be around 2-3 minutes.
Adding this to each container exposed behind an NLB have worked for me so far.
lifecycle:
preStop:
exec:
command: [ sh, -c, "sleep 180" ]
I would love to be able to get rid of this but it simply seems that the NLBs are extremely slow in performing management operations. I have even seen target registrations take almost 10 minutes.
I completely agree with what @ardove has said.
The point of this readinessGate feature is to delay the termination of the pod as long as the LB needs it. If I have to update my chart to put a sleep in the preStop hook then it means that this feature is not working. If I have to use the preStop hook then i might as well not even use this readinessGate feature.
In my observation the pod is allowed to terminate as soon as the new target group becomes ready/healthy. I have seen that the old target group was still draining after the pod terminates and obviously that's going to result in 502 errors for those requests.
This feature almost works. Without the feature enabled I see 30 seconds to 1 minute of solid 502 errors. With the feature enabled I get a brief sluggishness and maybe 1 or a handful of 502's. Hopefully you can get this fixed because unfortunately close to good isn't good enough for something like this.
I thought it might be useful to share this KubeCon talk, "The Gotchas of Zero-Downtime Traffic /w Kubernetes", where the speaker goes into the strategies required for zero-downtime rolling updates with Kubernetes deployments (at least as of 2022):
https://www.youtube.com/watch?v=0o5C12kzEDI
It can be a bit hard to conceptualise the limitations of the async nature of Ingress/Endpoint objects and Pod termination, so I found the above talk (and live demo) helped a lot.
Hopefully it's useful for others.
@M00nF1sh I am implementing the same in my kubernetes cluster but unable to calculate the sleep time for prestop hook and terminationGracePeriodSeconds. Currently terminationGracePeriodSeconds is 120 seconds, deregistration delay is 300 seconds.Do we have any mechanism to calculate this?
Does anyone have a update on this? After almost two years i cannot see that it has been solved natively yet.
I wonder if finalizers would solve this problem nicely here :thinking:
For clusters using traefik proxy as ingress it might be worth looking also into the entrypoint lifecycle feature to control graceful shutdowns https://doc.traefik.io/traefik/routing/entrypoints/#lifecycle. At least in this case it avoids the need for the sleep workaround :-)
https://www.reddit.com/r/ProgrammerHumor/comments/1092kmf/just_add_sleep/j3vqiv2?utm_medium=android_app&utm_source=share&context=3
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
would EndpointSlice terminating condition solve this issue? it says "Consumers of the EndpointSlice API, such as Kube-proxy and Ingress Controllers, can now use these conditions to coordinate connection draining events, by continuing to forward traffic for existing connections but rerouting new connections to other non-terminating endpoints." but i'm not sure it would work too in this case
https://kubernetes.io/blog/2022/12/30/advancements-in-kubernetes-traffic-engineering/
/remove-lifecycle rotten
Bumping this issue. Adding sleep() does not sound professional, it's a workaround and only workaround :/
I am experiencing this issue, too.
any update? does pod readiness gate work w/ v2.6 ?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten