k8s-node-termination-handler
k8s-node-termination-handler copied to clipboard
[Question] Why sometimes the node-termination is not able to delete all the pods
Hi,
I got preemptible nodes with more than 40 pods. For some reason is not able to delete all the pods. It starts and when it has deleted around 20 pods, it stops. No logs further this moment. I tried to delete the pods at the same time that listing the pods
eviction.go:66
is taking place , but no success either.
Thanks for your help
Hi, I have got the same issue.
Tried to test it, see output logs but no luck at all. It deletes just 6 pods in order they are listed. No logs further that.
Thanks for an idea
I did some testing, it looks like it does the job, but only if there is less than 11 pods on a node. If so, it removes all of them, if not, it stucks, processes just a few of the pods and ends suddenly, no logs further. The rest of the pods is running till the node hardware shutdown. So it takes a lot of time to handle these by k8s and reschedule.
Hi, Facing the same issue. I see from google docs that pre-empted node gets 30 seconds before it gets deleted. This value is set to TRUE as soon as the instance is marked to be preempted but there might be some delay between the G2 signal and the instance metadata value query receiving a response with value 'TRUE'. In essence after the preempted value is set to “TRUE”, the instance would be preempted within 30 seconds. But when I run node-termination-handler, I don't think it is capturing the right signal, because node-terminator doesn't seem to be getting 30 seconds in order to delete all the pods present on the node. It was able to delete only some of the pods and then exits without any further log.
I follow the GCP article
https://cloud.google.com/solutions/running-web-applications-on-gke-using-cost-optimized-pvms-and-traffic-director#post-preemption_validations
and applied the recommendations , including the daemonset that creates a systemd service that blocks the shutdown of the Kubelet process.
I also delegate to an external service in another pod in another namespace to execute the deletion of all pods outside the machine that is being deleted/preempted. With this solution the deletion of pods is always done outside the proper node.
But with no success.
I am wathing these events from kubernetes when node-termination tries to delete the pods
TaintManagerEviction | Cancelling deletion of Pod yyy/xx
Do you know what this means?