containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[EKS] [request]: Insufficient information when nodegroup upgrade fails

Open dza89 opened this issue 4 years ago • 16 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request Currently the information returned to EKS is insufficient to act upon.

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

The message below is from cloudformation but is the same for EKS.

ResourceStatusReason": "Update failed due to [{ErrorCode: PodEvictionFailure,ErrorMessage: Reached max retries while trying to evict pods from nodes in node group prod-nodegroup,ResourceIds: [ip-10-233-150-169.eu-west-1.compute.internal]}]",

It would be handy to know on which pod it's failing.

Are you currently working around this issue?

Additional context

Attachments

dza89 avatar Feb 18 '21 12:02 dza89

Thanks for the feedback, we'll include the pod name that failed to be evicted

mikestef9 avatar Feb 18 '21 21:02 mikestef9

@mikestef9 I'm sorry to do it this way, but this is really becoming an issue. I have inconsistent failing production upgrades, and I have no idea why. I can reproduce the issue in non-production but because it's inconsistent and I can't access any logs I'm a bit out of options. If I do a manual drain of each node it's fine, i've contacted AWS support but when they took a look it did go fine. Can this please be given more priority?

dza89 avatar Mar 30 '21 14:03 dza89

If the EKS controlplane logs are enabled to the respective EKS cluster, you could use cloudwatch logs insights with following filter to understand for which all pods the eviction was a failure.

  1. Go to cloudwatch logs console, then click on insights.
  2. Now in the current window, select log groups which corresponds to Cluster Control Plane logs
  3. Use the below filter to check on all eviction failures that happened to the EKS cluster fields @timestamp, @message | filter @logStream like "kube-apiserver-audit" | filter ispresent(requestURI) | filter objectRef.subresource = "eviction" | filter responseObject.status = "Failure"
  4. After running the query with above filter, you could see all the eviction failures which includes repeated retries performed by the client.
  5. The "requestURI" present inside the message in the results of above query will point you on which pods triggered eviction failure. image

Further to filter out all such duplicate eviction retries you could use the following query. image

TheRealGoku avatar Apr 01 '21 19:04 TheRealGoku

@TheRealGoku

Thanks, but that gives regular pod eviction failure, which happen all the time during an update because of PDB. So it doesn't really pinpoint me to the problem.

dza89 avatar Apr 02 '21 09:04 dza89

Any update on this? I am also facing the same problem. However, I see this problem more frequently if the calico network policy engine is running along with the vpc-cni plugin.

The issue seems to be related to the PDB (Pod Disruption Budget) due to which pods are not getting evicted from the node. Is there any option to forcefully evict the pods, I tried ForcedUpdateEnabled but doesn't help.

@dza89, Are you also having the problem more frequently when running with the calico? Any workaround apart from manually draining the nodes as that might not be feasible in the production setup?

amitkatyal avatar Aug 05 '21 00:08 amitkatyal

I've been experiencing this issue on our larger clusters (50 nodes). I use terraform to update my cluster and this is the error. However, I was able to drain the node manually. However, newly created nodes are still using the previous version of the cluster so now my cluster nodes are a mix of v1.21 and v1.20 nodes. I used the aws cli to update the nodes specifying the release version and using the force flag but this might cause some downtime to some pods. Another is to create another node group and transfer all the pods to the new one and delete the old.

Screen Shot 2021-10-14 at 9 28 39 AM

A side note, I was able to update one of our clusters with 5 nodes with no issues using the same terraform config.

juliussediolano avatar Oct 14 '21 00:10 juliussediolano

The issue seems to be related to the PDB (Pod Disruption Budget) due to which pods are not getting evicted from the node. Is there any option to forcefully evict the pods, I tried ForcedUpdateEnabled but doesn't help.

aws eks update-nodegroup-version
--cluster-name <your_cluster_name>
--nodegroup-name <your_nodegroup_name>
--kubernetes-version 1.20
--release-version 1.20.7-20211004
--launch-template version=your_launch_template_version,id=your_launch_template_id
--force

juliussediolano avatar Oct 14 '21 05:10 juliussediolano

I am seeing a very inconsistent behavior in updating nodegroups. Even deleting all the PodDisruptionBudget objects from the cluster, and using the "force" update option, I am still seeing a 1-hour update time on a 2-node nodegroup, which -sometimes- ends up in error.

jjhidalgar avatar Nov 08 '21 15:11 jjhidalgar

It's been 1 year since this request was filed. Any update please?

tigerinus avatar Feb 25 '22 17:02 tigerinus

One of the reasons why --force doesn't not work with PDB (Pod Disruption Budget) is due to this issue. CrashLoopBackOff state is considered in the Running phase and not the Failed phase due to which unready pods are not evicted as they will charge a PDB a disruption.

utk231 avatar Mar 09 '22 18:03 utk231

My workaround was to bring the "desired size" down to zero. Then manually terminate the nodes that refused to quit. It's likely these were the nodes that were preventing the update. On occasion I'll notice stubborn Istio pods.

tvkit avatar Jun 12 '22 20:06 tvkit

Another half of a year passed... Still facing the same issue.

setevoy2 avatar Oct 28 '22 12:10 setevoy2

we also faced same issue while upgrade nodes from 1.23v to 1.24.13 .. we have pdb but those pdb are for different node groups and those node. groups upgraded successfully. can we know the reason and what to fix .. we dont want to run upgrade with force option.

jainpratik163 avatar Jun 28 '23 12:06 jainpratik163

Here we noticed the reason was an istiod impossible-to-evict pod, so after we forced the pod deletion the process continued as expected

alexlopes avatar Mar 06 '24 18:03 alexlopes

Thanks for the feedback, we'll include the pod name that failed to be evicted

Was wondering if this is likely to be implemented?

TheBlackMini avatar May 01 '24 02:05 TheBlackMini

How about making the maximum number of retries configurable?

PDB are useful to make updates seamless, but sometimes it just takes a (very very) long time to take down the old pod / spin a new one. Would be immensely useful if that parameter "MaxNumberOfRetries" was exposed somehow.

Force update is not OK because it can induce downtime.

gillesdouaire avatar May 14 '24 18:05 gillesdouaire

Thanks for the feedback, we'll include the pod name that failed to be evicted

Is there any ETA for that? It's been more than 3 years(!) 🤓

soupdiver avatar Jul 10 '24 12:07 soupdiver