k8s-node-termination-handler
k8s-node-termination-handler copied to clipboard
Node Termination handler may still be necessary
The current README states this handler is deprecated in favor of the new Graceful Node Shutdown:
⚠️ Deprecation Notice As of Kubernetes 1.20, Graceful Node Shutdown replaces the need for GCP Node termination handler. GKE on versions 1.20+ enables Graceful Node Shutdown by default. Refer to the GKE documentation and Kubernetes documentation for more info about Graceful Node Shutdown (docs, blog post).
I have been using the Node Termination handler with GKE < 1.20, using pre-emptibles with GPUs. The handler was needed to avoid a race condition on node restart that sometimes caused pods not to correctly recognize the GPU.
I have moved to GKE 1.21.1-gke.2200 and found the same error I would get with version <1.20 without the Node Termination handler. This handler happens only occasionally, so it seems like potentially the same race condition.
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
I filed the following GKE issue. https://issuetracker.google.com/issues/192809336
For the moment, I would ask that this repo not be deprecated.
At the very least, it seems that the handler is useful through the 1.21 series of releases: https://issuetracker.google.com/issues/204415098
Hi @chrisroat! The GKE issue was closed recently. Are you still facing any problems with node shutdowns so you still need the node termination handler?
I no longer maintain the (closed-source) project that was hitting the issue. We had forked this repo to add the ability to handle spot instances.
@erichamc would be able to test, though I don't think it would be high priority to check. For reference, the symptom was that the cluster's gpu workloads would not restart properly after node preemptions. Over time, a cluster might -- if it had enough preemptions to trigger the issue -- show failing workloads unable to find the nvidia libraries. [@erichamc -dropping the termination handler would amount to dropping the null_resource
stanzas in infrastructure/apps/k8s/kubectl.tf]