Failed to detect node reboot when the Kubernetes VM worker node had a fast reboot
We're working with KMM Operator on a VM based kubernetes cluster, hitting an issue regarding the handling of node reboot.
We know currently KMM supports handling node reboot scenario, when the node got reboot its status condition changed: Ready -> NotReady -> Ready. The NMC reconciler would do unload the kmod then load the kmod again based on the Node status and its ready last transition timestamp. https://github.com/kubernetes-sigs/kernel-module-management/blob/9c4a309da7dc214a59c7cbf48bae88758227aade/internal/node/node.go#L88
However, as for the reboot of the VM worker node, we found that the reboot could be pretty fast. The result is that the k8s cluster may not claim the node becomes NotReady during the fast reboot. After rebooting KMM won't try to load the kernel module to the rebooted VM node.
Digging further into how the cluster determined the Node status to be Ready/NotReady, it is based on the communication between kubelet and the controller manager generally. The controller manager has a grace period to determine whether a node is ready or not. Normally the reboot recovery time is longer than the grace period so that the node can be claimed as NotReady during reboot. However, for the fast-rebooted VM, the cluster won't claim the node is NotReady during/after the reboot.
One of the workaround is to fine-tuning the duration / grace-period of the cluster's controller (e.g. for vanilla k8s --node-monitor-grace-period duration Default: 50s OpenShift cluster has similar parameters to fine-tune https://docs.redhat.com/en/documentation/openshift_container_platform/4.10/html/nodes/working-with-nodes#nodes-nodes-viewing-[…]_nodes-nodes-viewing) Different customers may have various configs on the cluster level, tuning the parameters is not a good approach.
Now we hit a use case where detecting the node reboot solely depending on the node Ready condition is not enough, we're proposing to also detecting the node reboot happened or not, by additionally checking the Node bootID field from Node status. After each reboot the node bootID will change, which is a more solid and promising way to determine whether a node has been rebooted or not.
Please take a look and think about the idea is good or not.