linode-cloud-controller-manager icon indicating copy to clipboard operation
linode-cloud-controller-manager copied to clipboard

CCM does not properly detect nodes that are powered off

Open jnschaeffer opened this issue 4 years ago • 2 comments

Bug Reporting

CCM does not properly detect nodes that are powered off.

Expected Behavior

On shutdown of a Kubernetes node, the CCM detects that it is powered off and migrates workloads off of the node.

Actual Behavior

The node status becomes NotReady after a few minutes because kubelet stops responding, but pods still have a status of Running and do not get rescheduled.

Steps to Reproduce the Problem

  1. Shut down a Kubernetes node on a cluster running the CCM, wait 5 minutes
  2. Observe node status with kubectl get nodes, note that the down node has a status of NotReady
  3. Observe pod status with kubectl get pods -A, note that pods on the node are not rescheduled

jnschaeffer avatar Apr 20 '20 16:04 jnschaeffer

This could be pod-eviction-timeout https://kubernetes.io/docs/concepts/architecture/nodes/#condition or taint-based eviction policies: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions

displague avatar Jun 23 '21 12:06 displague

I'm adding another context hint on this topic, Non-Graceful shutdowns.

  • https://kubernetes.io/blog/2022/05/20/kubernetes-1-24-non-graceful-node-shutdown-alpha/
  • https://kubernetes.io/docs/concepts/architecture/nodes/#non-graceful-node-shutdown
  • https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/2268-non-graceful-shutdown/README.md

I don't see any conversations about how a cloud-provider implementation would be expected to signal the shutdown conditions, perhaps that will come after the Alpha phase of the feature.

displague avatar Sep 08 '22 17:09 displague