k8s-node-termination-handler icon indicating copy to clipboard operation
k8s-node-termination-handler copied to clipboard

Node Termination handler may still be necessary

Open chrisroat opened this issue 2 years ago • 3 comments

The current README states this handler is deprecated in favor of the new Graceful Node Shutdown:

⚠️ Deprecation Notice As of Kubernetes 1.20, Graceful Node Shutdown replaces the need for GCP Node termination handler. GKE on versions 1.20+ enables Graceful Node Shutdown by default. Refer to the GKE documentation and Kubernetes documentation for more info about Graceful Node Shutdown (docs, blog post).

I have been using the Node Termination handler with GKE < 1.20, using pre-emptibles with GPUs. The handler was needed to avoid a race condition on node restart that sometimes caused pods not to correctly recognize the GPU.

I have moved to GKE 1.21.1-gke.2200 and found the same error I would get with version <1.20 without the Node Termination handler. This handler happens only occasionally, so it seems like potentially the same race condition.

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

I filed the following GKE issue. https://issuetracker.google.com/issues/192809336

For the moment, I would ask that this repo not be deprecated.

chrisroat avatar Jul 14 '21 00:07 chrisroat

At the very least, it seems that the handler is useful through the 1.21 series of releases: https://issuetracker.google.com/issues/204415098

chrisroat avatar Jan 07 '22 05:01 chrisroat

Hi @chrisroat! The GKE issue was closed recently. Are you still facing any problems with node shutdowns so you still need the node termination handler?

torbendury avatar Mar 16 '22 08:03 torbendury

I no longer maintain the (closed-source) project that was hitting the issue. We had forked this repo to add the ability to handle spot instances.

@erichamc would be able to test, though I don't think it would be high priority to check. For reference, the symptom was that the cluster's gpu workloads would not restart properly after node preemptions. Over time, a cluster might -- if it had enough preemptions to trigger the issue -- show failing workloads unable to find the nvidia libraries. [@erichamc -dropping the termination handler would amount to dropping the null_resource stanzas in infrastructure/apps/k8s/kubectl.tf]

chrisroat avatar Mar 16 '22 15:03 chrisroat