k8s-node-termination-handler icon indicating copy to clipboard operation
k8s-node-termination-handler copied to clipboard

[Question] Why sometimes the node-termination is not able to delete all the pods

Open santinoncs opened this issue 3 years ago • 5 comments

Hi,

I got preemptible nodes with more than 40 pods. For some reason is not able to delete all the pods. It starts and when it has deleted around 20 pods, it stops. No logs further this moment. I tried to delete the pods at the same time that listing the pods

eviction.go:66

is taking place , but no success either.

Thanks for your help

santinoncs avatar Dec 02 '20 14:12 santinoncs

Hi, I have got the same issue.

Tried to test it, see output logs but no luck at all. It deletes just 6 pods in order they are listed. No logs further that.

Thanks for an idea

toms049 avatar Dec 17 '20 13:12 toms049

I did some testing, it looks like it does the job, but only if there is less than 11 pods on a node. If so, it removes all of them, if not, it stucks, processes just a few of the pods and ends suddenly, no logs further. The rest of the pods is running till the node hardware shutdown. So it takes a lot of time to handle these by k8s and reschedule.

toms049 avatar Dec 18 '20 21:12 toms049

Hi, Facing the same issue. I see from google docs that pre-empted node gets 30 seconds before it gets deleted. This value is set to TRUE as soon as the instance is marked to be preempted but there might be some delay between the G2 signal and the instance metadata value query receiving a response with value 'TRUE'. In essence after the preempted value is set to “TRUE”, the instance would be preempted within 30 seconds. But when I run node-termination-handler, I don't think it is capturing the right signal, because node-terminator doesn't seem to be getting 30 seconds in order to delete all the pods present on the node. It was able to delete only some of the pods and then exits without any further log.

laxmiprasanna-gunna avatar Jan 21 '21 10:01 laxmiprasanna-gunna

I follow the GCP article

https://cloud.google.com/solutions/running-web-applications-on-gke-using-cost-optimized-pvms-and-traffic-director#post-preemption_validations

and applied the recommendations , including the daemonset that creates a systemd service that blocks the shutdown of the Kubelet process.

I also delegate to an external service in another pod in another namespace to execute the deletion of all pods outside the machine that is being deleted/preempted. With this solution the deletion of pods is always done outside the proper node.

But with no success.

santinoncs avatar Jan 25 '21 11:01 santinoncs

I am wathing these events from kubernetes when node-termination tries to delete the pods

TaintManagerEviction | Cancelling deletion of Pod yyy/xx

Do you know what this means?

santinoncs avatar Apr 23 '21 10:04 santinoncs