linode-cloud-controller-manager
linode-cloud-controller-manager copied to clipboard
CCM does not properly detect nodes that are powered off
Bug Reporting
CCM does not properly detect nodes that are powered off.
Expected Behavior
On shutdown of a Kubernetes node, the CCM detects that it is powered off and migrates workloads off of the node.
Actual Behavior
The node status becomes NotReady
after a few minutes because kubelet stops responding, but pods still have a status of Running
and do not get rescheduled.
Steps to Reproduce the Problem
- Shut down a Kubernetes node on a cluster running the CCM, wait 5 minutes
- Observe node status with
kubectl get nodes
, note that the down node has a status ofNotReady
- Observe pod status with
kubectl get pods -A
, note that pods on the node are not rescheduled
This could be pod-eviction-timeout
https://kubernetes.io/docs/concepts/architecture/nodes/#condition
or taint-based eviction policies: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions
I'm adding another context hint on this topic, Non-Graceful shutdowns.
- https://kubernetes.io/blog/2022/05/20/kubernetes-1-24-non-graceful-node-shutdown-alpha/
- https://kubernetes.io/docs/concepts/architecture/nodes/#non-graceful-node-shutdown
- https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/2268-non-graceful-shutdown/README.md
I don't see any conversations about how a cloud-provider implementation would be expected to signal the shutdown conditions, perhaps that will come after the Alpha phase of the feature.