machine-controller-manager icon indicating copy to clipboard operation
machine-controller-manager copied to clipboard

☂️ Improve health checks based on node conditions

Open prashanth26 opened this issue 3 years ago • 0 comments

Context

Currently there are node conditions added to the node by different actors like kubelet , Node Problem Detector(NPD), Network Problem Detector. But MCM acts only on a few by default and not that effectively.

Customer could update the conditions on shoot level from this section. Current shoot defaults are

- ReadonlyFilesystem(NPD)
- KernelDeadlock (NPD)
- DiskPressure(kubelet)

MCM has its own defaults but ofcourse they are overridden by shoot defaults (or values provided by customer) MCM defaults are:

- KernelDeadlock (NPD)
- ReadonlyFilesystem (NPD)
- DiskPressure (kubelet)
- NetworkUnavailable (kubelet)

Example of conditions added by Network problem detector are

  - lastHeartbeatTime: "2023-02-22T07:27:28Z"
    lastTransitionTime: "2023-02-20T14:18:15Z"
    message: no cluster network problems
    reason: NoNetworkProblems
    status: "False"
    type: ClusterNetworkProblem
  - lastHeartbeatTime: "2023-02-22T07:27:23Z"
    lastTransitionTime: "2023-02-21T13:59:48Z"
    message: no host network problems
    reason: NoNetworkProblems
    type: HostNetworkProblem
    status: "False"

Goal

MCM should use the node conditions more effectively and do the replacement if it feels the node is unhealthy according to the condition.

Quick Improvements

  • [ ] Identify Unrecoverable nodeConditions (ex- ReadonlyFilesystem) and don't wait healthTimeout, but immediately make machine Failed.
    • [ ] could be deceiving as it is reported if other file system is read-only https://github.com/kubernetes/node-problem-detector/issues/474
    • [ ] taint based evictions don’t follow PDB (confirm, if yes , then do health timeout 0) , https://github.com/kubernetes/website/issues/7829
  • [ ] MCM should consider node Conditions added by Network Problem Detector also.

Research

  • [ ] collect metrics on currently acted upon node Conditions
  • [ ] relevance of KernelDeadlock condition. is it just for docker ? gardener doesn't support docker . Is it a permanent condition ?
  • [ ] look into effectiveness of taint node by condition . Refer similar issue on NPD https://github.com/kubernetes/node-problem-detector/issues/457 . Currently the tainting is done by KCM only, and only for kubelet added node conditions
  • [ ] feasibility of firing remediation process of NPD
  • [ ] edit NPD config to introduce new node conditions which could be beneficial for Gardener scenario
  • [ ] feasibility of different timeouts for different conditions , or would it be to much fine tuning. see live issue # 2653

Why is this needed:

To have better , reliable recovery of nodes and lesser downtimes for customer workloads.

prashanth26 avatar May 06 '21 03:05 prashanth26