machine-controller-manager
machine-controller-manager copied to clipboard
☂️ Improve health checks based on node conditions
Context
Currently there are node conditions added to the node by different actors like kubelet , Node Problem Detector(NPD), Network Problem Detector. But MCM acts only on a few by default and not that effectively.
Customer could update the conditions on shoot level from this section. Current shoot defaults are
- ReadonlyFilesystem(NPD)
- KernelDeadlock (NPD)
- DiskPressure(kubelet)
MCM has its own defaults but ofcourse they are overridden by shoot defaults (or values provided by customer) MCM defaults are:
- KernelDeadlock (NPD)
- ReadonlyFilesystem (NPD)
- DiskPressure (kubelet)
- NetworkUnavailable (kubelet)
Example of conditions added by Network problem detector are
- lastHeartbeatTime: "2023-02-22T07:27:28Z"
lastTransitionTime: "2023-02-20T14:18:15Z"
message: no cluster network problems
reason: NoNetworkProblems
status: "False"
type: ClusterNetworkProblem
- lastHeartbeatTime: "2023-02-22T07:27:23Z"
lastTransitionTime: "2023-02-21T13:59:48Z"
message: no host network problems
reason: NoNetworkProblems
type: HostNetworkProblem
status: "False"
Goal
MCM should use the node conditions more effectively and do the replacement if it feels the node is unhealthy according to the condition.
Quick Improvements
- [ ] Identify Unrecoverable nodeConditions (ex-
ReadonlyFilesystem
) and don't wait healthTimeout, but immediately make machineFailed
.- [ ] could be deceiving as it is reported if other file system is read-only https://github.com/kubernetes/node-problem-detector/issues/474
- [ ] taint based evictions don’t follow PDB (confirm, if yes , then do health timeout 0) , https://github.com/kubernetes/website/issues/7829
- [ ] MCM should consider node Conditions added by Network Problem Detector also.
Research
- [ ] collect metrics on currently acted upon node Conditions
- [ ] relevance of
KernelDeadlock
condition. is it just for docker ? gardener doesn't support docker . Is it a permanent condition ? - [ ] look into effectiveness of taint node by condition . Refer similar issue on NPD https://github.com/kubernetes/node-problem-detector/issues/457 . Currently the tainting is done by KCM only, and only for kubelet added node conditions
- [ ] feasibility of firing remediation process of NPD
- [ ] edit NPD config to introduce new node conditions which could be beneficial for Gardener scenario
- [ ] feasibility of different timeouts for different conditions , or would it be to much fine tuning. see live issue # 2653
Why is this needed:
To have better , reliable recovery of nodes and lesser downtimes for customer workloads.