Support overriding/setting `disableHealthTimeout` via annotation on node
How to categorize this issue?
/area usability /area control-plane /kind enhancement /priority 3
What would you like to be added:
An annotation similar to the already existing node.machine.sapcloud.io/trigger-deletion-by-mcm that, if set on a node, disables the machineHealthTimeout for the specific machine until the annotation is removed/set to false.
Why is this needed: We have the use-case where some VMs are powered off because of maintenances unrelated from any gardener related maintenance (OS updated, k8s update, ...).
An external, gardener independent, component (e.g., a daemonset) can check for such maintenances and set the annotation accordingly. After the maintenance is offline and the VM is up and running again, the component would remove the annotation.
Using the in-place upgrade strategy is not suitable for us, as we still want to update nodes by rolling replace. Additionally, the in-place upgrade strategy would cause all nodes of the workerpool to have disableHealthTimeout always set to true (not just while an upgrade is in progress), but we still want to automatically replace machines that unexpectedly went offline.
Additional context/small discussion can be found here: https://gardener-cloud.slack.com/archives/C045DSWJZB9/p1747128961067609
CC: @acumino @unmarshall
If you agree with the proposed solution, we will gladly implement it 😄
@Kumm-Kai We have some questions
- How would MCM proceed with rolling updated for this machine deployment? would these annotated machines be ignored?
- When the annotation is eventually removed from the node, what is the action expected from MCM at that point?
@Kumm-Kai We have some questions
1. How would MCM proceed with rolling updated for this machine deployment? would these annotated machines be ignored? 2. When the annotation is eventually removed from the node, what is the action expected from MCM at that point?
I'm not sure how the disableHealthTimeout field currently influences rollouts. But, without having looked into it that deeply, I would envision the annotation to behave exactly the same as when the machine object has disableHealthTimeout true.
So to answer your specific questions:
- MCM would behave like usual, doing machine rolling updates as configured in the
machineDeployment(maxSurgemaxUnavailable, ...) - Once removed, MCM should "restart" respecting the
machineHealthTimeoutfrom the time when the annotation got removed. Here we probably need to updatemachine.Status.CurrentStatus.LastUpdateTime.Timetotime.Now()as it is used in this check: https://github.com/gardener/machine-controller-manager/blob/e18ab264fce9893f779e30e15821162a95b1e581/pkg/util/provider/machinecontroller/machine_util.go#L1010-L1011
- MCM would behave like usual, doing machine rolling updates as configured in the machineDeployment (maxSurge maxUnavailable, ...)
Well, this does not answer the question. During a rolling update machines are terminated and new machines will be created. Now since the annotation is set on the node, one of two things can happen
- We allow the annotated node to be rolled. In this case the new replacement node will not have this annotation and the user will have to re-add this annotation (or have some logic in mcm to identify the newly created node that replaced the one in question and annotate it)
- We do not allow annotated nodes to be rolled. In this case we can end up in a state where some machines in the cluster have an outdated component (OS version for example). Is this then the users responsibility to manually upgrade the node? also this ties with the second question, when the annotation is removed, will mcm have to see if the node needed to be upgraded and perform the upgradation?
@aaronfern in our use-case, the annotation would be set to a specific node. If the node is terminated because of a rolling update, we don't need to add the annotation to the replacement node. So annotated nodes can be rolled just as usual and mcm would not need any special logic.
Ok, please proceed with the PR. You can refactor the disableHealthTimeout into a separate function that checks both the machine.Spec.MachineConfiguration.DisableHealthTimeout and the node annotation you propose, so that it is easy to unit-test. You can introduce the node annotation constant in pkg/machine/v1alpha1/constants.go