bottlerocket
bottlerocket copied to clipboard
Pods stuck in Terminating when PVC is attached during graceful node evacuations
Platform I'm building on:
We are running on EKS 1.26, using Bottlerocket 1.14.x AMIs. Within our clusters we have both the AWS EBS and EFS CSI drivers working and have been using these systems now for multiple years. We recently tried upgrading the Bottlerocket AMIs from the 1.14.1 release to the 1.14.3 release. The initial upgrade and tests went fine - new pods worked, new nodes worked, etc. A few days after the release, we started to see problematic behavior though when gracefully cordning/evicting nodes so they could be shut down.
What I expected to happen:
We expected to see no differences between the old and new nodes. However, what we found was that many different services that were using PVCs (both EFS and EBS backed) were seeing many pods stuck in Terminating states for 10, 15, 20 minutes. Eventually the Pods were purged, but only when the underlying node that held them was finally deleted from the cluster.
We saw this behavior across 3 different clusters, across 2 different applications on those clusters, across both EBS and EFS PVCs.
What actually happened:
We generally use Spot.io to manage our nodes. When the Ocean product terminates nodes, it first cordons a node and taints it. After that, it launches new compute capacity to handle the workloads that need to move. Finally when the compute capacity is launched and ready, Ocean will begin evicting the pods off the node-to-be-terminated. In our environment, we launch thousands of nodes a day, so we test this process regularly.
We saw three failures shortly after migrating to the 1.14.3 images...
Failure 1: EFS Backed PVC
In one of our development workloads, an engineer reached out to me for help because all of his pods were stuck in the Terminating state, but they were not being terminated. After troubleshooting a bit, we manually deleted the pods with kubectl delete pod --force, which caused new pods to come up. We let this problem go initially as a strange one-off situation.
Failure 2: EBS backed PVC
A day after the first problem above, we had two failures on the same service across different clusters. We were alerted to pods using EBS-backed PVCS that were stuck in the Terminating state again. Again, after much troubleshooting, we solved the problem by forcefully deleting the pods with kubectl delete --force.
The Fix: Rolling back to 1.14.1..
At this point, I had a suspicion that there was something wrong with the AMI update so we rolled that back. Since rolling the update back, we have not seen a single failure. We noted that the pods stuck in Terminating state were on nodes that were actively being drained for replacement.
Our suspicion was that https://github.com/bottlerocket-os/bottlerocket/issues/3230 was somehow biting us ... but after reading through this, it seems that 1.14.2+ is supposed to contain the fix for a problem (that we weren't seeing on 1.14.1). So we are unclear at this point what the root cause is or why we're seeing this problem.
How to reproduce the problem:
I haven't fully tested the instructions - but this is roughly how to replicate the environment we saw the issues in.
- Launch an EKS 1.26 cluster
- Install AWS EFS CSI Driver (helm chart)
2.17.2 - Install AWS EBS CSI Driver (helm chart)
2.4.9 - Use Bottlerocket 1.14.3 AMIs for EKS 1.26
- Launch a workload that uses PVC volumes on some nodes
- Cordon and drain the workload off those nodes.
I can take one of our test environments and re-run the upgrade and test the failure again for log collection purposes - but I need to know what logs to collect.
Thanks for providing so much detail. I'll take a look at this.
Sorry for the latency here, but I've had more time to dig into this.
I'm attempting to set up a reproducer locally. So far terminations are succeeding, but I'm planning to run this in a loop until I detect an issue.
If you happen to set up a reproducer, I think running logdog on the Bottlerocket node would be helpful, as well as getting the log output from the pods associated with the EBS/EFS drivers.
I'll report back whether or not I can reliably reproduce in the meantime.
Thanks - next week I will try reproducing this again and run logdog. Is there a private place I can upload the logs to. Should I open an AWS ticket to do so?
Yes, I think a support ticket may be best. Thanks!
We are facing the same issue, pods that have a pvc attached are being stuck in terminating state. @cbgbt @diranged Were you able to find something out? Can we prioritise this? We are using EKS 1.27 and Bottlerocket version k8s-1.27-x86_64-v1.14.2
We are using EKS 1.27 and Bottlerocket version k8s-1.27-x86_64-v1.14.1 still same issue happening. Is there any resolution
@RahulGP14 are you able to reliably reproduce this? If so, can you capture the logdog output here suggested above? I think we're still having trouble reproducing this to understand what the failure is, and how to go about fixing it.
we find too this behaviour on pods on nodes BottleRocket, for a while by now. Not specifically needed to be pods with PVC attached, at Deployment rollouts it happens quite often, to see pods stuck on terminating by hours even days.
K8S 1.27 and BottleRocket on 1.14 and 1.15 nodes.
I guess the same question for you @Guillermogsjc - can you capture the logdog output when this is happening. That should hopefully give some clues as to what is going on when there is this delay. Thanks!
Possibly related to https://github.com/kubernetes/kubernetes/issues/118261 which first appeared in 1.26 - if pods have duplicate fields (ports or environment variables), then they can't be updated via server-side apply and get stuck in Terminating.
Might be worth double-checking your affected pods to see if this applies. There's a fix now that's being backported to 1.26, 1.27, 1.28.
@bcressey can we get an updated BottleRocket release with these K8S versions: https://github.com/kubernetes/kubernetes/issues/118261#issuecomment-1790657622?
@diranged I'll check on that - #3612 has some kubelet updates but not 1.28.4 / 1.27.8 / 1.26.11 with that fix.
#3612 now has those versions, so I expect they'll go out in the next release.
Sorry @bcressey for not providing logs, we updated nodegroups with latest eksctl that incorporates new bottlerocket versions, and we did not observe more hang cases at least in our workloads... so I guess that the feature that you mentioned solved this hang .
Thanks!