aws-node-termination-handler Global timeout reached

Describe the bug

Lately, we saw in all of our environments that we suffer from repeated errors of:

There was a problem while trying to cordon and drain the node, [error when waiting for pod in namespace "default" to terminate: global timeout reached: 2m0s, error when waiting for pod in namespace "default" to terminate: global timeout reached: 2m0s

The log indicates that NTH got a timeout on all pods that were running on the spot machine, and from what i understand, one pod took too much time to terminate gracefully which led to ungraceful shutdown of the node and the pods that weren't evicted yet.

The issue here is that I can't understand which pod didn't terminate gracefully, as the log indicates that all the pods got timed out. Also NTH logged that he evicted all pods at the same time, so all pods got the evict command. eventually, those nodes get the SIGKILL command to kubelet because of Spots 2 minutes timeout and to issues of pod scheduling.

Would like to get assistance in understanding this issue and finding the pods which causing the issues.

Steps to reproduce Send some spot interruptions, happen from time to time

Expected outcome All spots are being terminated in less than 2 minutes.

Application Logs

There was a problem while trying to cordon and drain the node, [error when waiting for pod in namespace "default" to terminate: global timeout reached: 2m0s, error when waiting for pod in namespace "default" to terminate: global timeout reached: 2m0s

Environment Production

NTH App Version: v1.20
NTH Mode (IMDS/Queue processor): Queue processor
OS/Arch: Linux Ubuntu 20.04.6 LTS arm64
Kubernetes version: 1.25.15
Installation method: Helm chart

Dec 24 '23 17:12 doryer

Hi @doryer. I've been trying to recreate your issue, but the logged error message I've been receiving from NTH includes the name(s) of the pods that were not gracefully terminated due to reaching the global timeout.

Can you share your NTH configurations? Also can you share the log statements where NTH first indicates its evicting the pods? I'd like to see if the pods are always missing their names.

Jan 22 '24 22:01 GavinBurris42

Hi @doryer. I've been trying to recreate your issue, but the logged error message I've been receiving from NTH includes the name(s) of the pods that were not gracefully terminated due to reaching the global timeout.

Can you share your NTH configurations? Also can you share the log statements where NTH first indicates its evicting the pods? I'd like to see if the pods are always missing their names.

Mine logs contain also the pods, but it contain all pods that were running on that node, including the pods that got evicted successfully, which makes it difficult to understand which specific pod was causing the global timeout error. So maybe I wasn't clear but also in the evicted pods I see the pod names that are running on that node, just censored it.

About NTH configs, which one to share?

Jan 24 '24 10:01 doryer

@doryer upon the draining of a node, each pod's evictions are run independently and in parallel. This indicates that one pod's failed eviction before the global timeout does not affect the the eviction of another pod. The process of draining is further explained here and illustrated in the flow diagram.

Jan 26 '24 20:01 GavinBurris42

@doryer upon the draining of a node, each pod's evictions are run independently and in parallel. This indicates that one pod's failed eviction before the global timeout does not affect the the eviction of another pod. The process of draining is further explained here and illustrated in the flow diagram.

I understand the flow of draining, but I would expect it in the logs or metrics to see which pods failed to be evicted in less than the global timeout ( 2 minutes ). Today because we're running lots of pods on nodes, each log contains the same error for 30 pods and it is hard to see which pod having the global timeout issue. metric like: nth_failed_timeout_pods{pod_name="xxx"} In that case we can see and handle services which takes too much time to terminate.

Feb 13 '24 09:02 doryer

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

Mar 14 '24 17:03 github-actions[bot]

This issue was closed because it has become stale with no activity.

Mar 19 '24 17:03 github-actions[bot]

aws-node-termination-handler aws-node-termination-handler copied to clipboard

Global timeout reached

aws-node-termination-handler
aws-node-termination-handler copied to clipboard