aws-node-termination-handler
aws-node-termination-handler copied to clipboard
Global timeout reached
Describe the bug
Lately, we saw in all of our environments that we suffer from repeated errors of:
There was a problem while trying to cordon and drain the node, [error when waiting for pod
The log indicates that NTH got a timeout on all pods that were running on the spot machine, and from what i understand, one pod took too much time to terminate gracefully which led to ungraceful shutdown of the node and the pods that weren't evicted yet.
The issue here is that I can't understand which pod didn't terminate gracefully, as the log indicates that all the pods got timed out. Also NTH logged that he evicted all pods at the same time, so all pods got the evict command. eventually, those nodes get the SIGKILL command to kubelet because of Spots 2 minutes timeout and to issues of pod scheduling.
Would like to get assistance in understanding this issue and finding the pods which causing the issues.
Steps to reproduce Send some spot interruptions, happen from time to time
Expected outcome All spots are being terminated in less than 2 minutes.
Application Logs
There was a problem while trying to cordon and drain the node, [error when waiting for pod
Environment Production
- NTH App Version: v1.20
- NTH Mode (IMDS/Queue processor): Queue processor
- OS/Arch: Linux Ubuntu 20.04.6 LTS arm64
- Kubernetes version: 1.25.15
- Installation method: Helm chart
Hi @doryer. I've been trying to recreate your issue, but the logged error message I've been receiving from NTH includes the name(s) of the pods that were not gracefully terminated due to reaching the global timeout.
Can you share your NTH configurations? Also can you share the log statements where NTH first indicates its evicting the pods? I'd like to see if the pods are always missing their names.
Hi @doryer. I've been trying to recreate your issue, but the logged error message I've been receiving from NTH includes the name(s) of the pods that were not gracefully terminated due to reaching the global timeout.
Can you share your NTH configurations? Also can you share the log statements where NTH first indicates its evicting the pods? I'd like to see if the pods are always missing their names.
Mine logs contain also the pods, but it contain all pods that were running on that node, including the pods that got evicted successfully, which makes it difficult to understand which specific pod was causing the global timeout error. So maybe I wasn't clear but also in the evicted pods I see the pod names that are running on that node, just censored it.
About NTH configs, which one to share?
@doryer upon the draining of a node, each pod's evictions are run independently and in parallel. This indicates that one pod's failed eviction before the global timeout does not affect the the eviction of another pod. The process of draining is further explained here and illustrated in the flow diagram.
@doryer upon the draining of a node, each pod's evictions are run independently and in parallel. This indicates that one pod's failed eviction before the global timeout does not affect the the eviction of another pod. The process of draining is further explained here and illustrated in the flow diagram.
I understand the flow of draining, but I would expect it in the logs or metrics to see which pods failed to be evicted in less than the global timeout ( 2 minutes ). Today because we're running lots of pods on nodes, each log contains the same error for 30 pods and it is hard to see which pod having the global timeout issue. metric like: nth_failed_timeout_pods{pod_name="xxx"} In that case we can see and handle services which takes too much time to terminate.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.
This issue was closed because it has become stale with no activity.