machine-controller-manager icon indicating copy to clipboard operation
machine-controller-manager copied to clipboard

Support overriding/setting `disableHealthTimeout` via annotation on node

Open Kumm-Kai opened this issue 6 months ago • 5 comments

How to categorize this issue?

/area usability /area control-plane /kind enhancement /priority 3

What would you like to be added: An annotation similar to the already existing node.machine.sapcloud.io/trigger-deletion-by-mcm that, if set on a node, disables the machineHealthTimeout for the specific machine until the annotation is removed/set to false.

Why is this needed: We have the use-case where some VMs are powered off because of maintenances unrelated from any gardener related maintenance (OS updated, k8s update, ...).

An external, gardener independent, component (e.g., a daemonset) can check for such maintenances and set the annotation accordingly. After the maintenance is offline and the VM is up and running again, the component would remove the annotation.

Using the in-place upgrade strategy is not suitable for us, as we still want to update nodes by rolling replace. Additionally, the in-place upgrade strategy would cause all nodes of the workerpool to have disableHealthTimeout always set to true (not just while an upgrade is in progress), but we still want to automatically replace machines that unexpectedly went offline.

Additional context/small discussion can be found here: https://gardener-cloud.slack.com/archives/C045DSWJZB9/p1747128961067609

CC: @acumino @unmarshall

If you agree with the proposed solution, we will gladly implement it 😄

Kumm-Kai avatar May 23 '25 06:05 Kumm-Kai

@Kumm-Kai We have some questions

  1. How would MCM proceed with rolling updated for this machine deployment? would these annotated machines be ignored?
  2. When the annotation is eventually removed from the node, what is the action expected from MCM at that point?

aaronfern avatar Jun 13 '25 06:06 aaronfern

@Kumm-Kai We have some questions

1. How would MCM proceed with rolling updated for this machine deployment? would these annotated machines be ignored?

2. When the annotation is eventually removed from the node, what is the action expected from MCM at that point?

I'm not sure how the disableHealthTimeout field currently influences rollouts. But, without having looked into it that deeply, I would envision the annotation to behave exactly the same as when the machine object has disableHealthTimeout true.

So to answer your specific questions:

  1. MCM would behave like usual, doing machine rolling updates as configured in the machineDeployment (maxSurge maxUnavailable, ...)
  2. Once removed, MCM should "restart" respecting the machineHealthTimeout from the time when the annotation got removed. Here we probably need to update machine.Status.CurrentStatus.LastUpdateTime.Time to time.Now() as it is used in this check: https://github.com/gardener/machine-controller-manager/blob/e18ab264fce9893f779e30e15821162a95b1e581/pkg/util/provider/machinecontroller/machine_util.go#L1010-L1011

Kumm-Kai avatar Jun 13 '25 12:06 Kumm-Kai

  1. MCM would behave like usual, doing machine rolling updates as configured in the machineDeployment (maxSurge maxUnavailable, ...)

Well, this does not answer the question. During a rolling update machines are terminated and new machines will be created. Now since the annotation is set on the node, one of two things can happen

  1. We allow the annotated node to be rolled. In this case the new replacement node will not have this annotation and the user will have to re-add this annotation (or have some logic in mcm to identify the newly created node that replaced the one in question and annotate it)
  2. We do not allow annotated nodes to be rolled. In this case we can end up in a state where some machines in the cluster have an outdated component (OS version for example). Is this then the users responsibility to manually upgrade the node? also this ties with the second question, when the annotation is removed, will mcm have to see if the node needed to be upgraded and perform the upgradation?

aaronfern avatar Jul 01 '25 10:07 aaronfern

@aaronfern in our use-case, the annotation would be set to a specific node. If the node is terminated because of a rolling update, we don't need to add the annotation to the replacement node. So annotated nodes can be rolled just as usual and mcm would not need any special logic.

Kumm-Kai avatar Jul 18 '25 09:07 Kumm-Kai

Ok, please proceed with the PR. You can refactor the disableHealthTimeout into a separate function that checks both the machine.Spec.MachineConfiguration.DisableHealthTimeout and the node annotation you propose, so that it is easy to unit-test. You can introduce the node annotation constant in pkg/machine/v1alpha1/constants.go

elankath avatar Sep 09 '25 10:09 elankath