cluster-api MachineSetPreflightChecks + maxSurge zero results in MDs downscaled for a time longer than desirable

What would you like to be added (User Story)?

As a user, when I use maxSurge zero, I would like the time where my MachineDeploymenst are downscaled to be as short as possible, during a cluster Kubernetes upgrade -- today they are downscaled at once when the MD is updated and this remains true during the whole duration of the control plane nodes rolling update.

The evolution that would be desirable would be a behavior where, when maxSurge zero is used, the old ReplicaSet for a MachineDeployment is not downscaled until the new ReplicaSet will be able to be upscaled (based on MachineSetPreflightChecks).

Said otherwise:

with current 1.9.x code, an MD is not at its target number of replicas for a duration which is: time to update CP nodes + time to update the MD nodes
ideally, an MD would not be at its target number of replicas for a duration which would just be the time to update the MD nodes

Detailed Description

Example Scenario

Let's consider the following example scenario:

MachineSetPreflightChecks are enabled (they have been enabled by default since 1.9.x)
one or more MachineDeployments are used
- they use RollingUpdate strategy with maxSurge is set to zero
CAPI resources for a cluster are updated to trigger a Kubernetes version upgrade (CP and MDs)

Current behavior

What happens today is the following:

a CP node rolling update is triggered and starts
at once, for all MDs, a new ReplicaSet is created, and the previous one is downscaled (maxSurge 0)
at this point all MDs are downscaled, and this will persist until the CP node rolling update is finished (which on baremetal isn't quick, ~1h being a quite typical order of magnitude -- 3 nodes, each rebuild in 20 minutes)

Problem statement

With maxSurge zero, during a Kubernetes upgrade, all MDs will be downscaled by one for a time longer than desirable (during the time needed to roll'update all the CP nodes), while ideally the MDs could remain untouched during the CP nodes rolling update.

The evolution that would be desirable would be a behavior where the old ReplicaSet is not downscaled until the new ReplicaSet will be able to be upscaled.

Relevance

The scenario above is common place for CAPI baremetal deployments (e.g. with capm3) where it is common to not have spare hardware. In baremetal low-footprint scenarios where clusters have a low number of nodes, the difference in terms of available processing resources can be significant.

Note

I opted for filing this as a "feature request", but my feeling is that some might qualify the current behavior as a regression. Please feel free to requalify as a "bug report" if you think this is deserved.

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature /area machinedeployment /area upgrades

May 12 '25 13:05 tmmorin

Question: Are you using clusterclass or how are you orchestrating it?

Would be good to have a more fine-grained example (with yaml's?).

Jun 11 '25 12:06 chrischdi

Question: Are you using clusterclass or how are you orchestrating it?

No clusterclass.

And no orchestration either, I would say, although I'm not 100% sure what you mean by "orchestration" in the context here.

We're using a single Helm chart (here) that produces both the CAPI control plane resource and the MDs (it updates them all the same time).

Would be good to have a more fine-grained example (with yaml's?).

Can you tell me what part of the manifest is relevant (our manifests produced are fairly standard, we set k8s version under KubeadmControlplane.spec.version and MachineDeployment.spec.template.spec.version .

Jun 11 '25 13:06 tmmorin

Finally found the time to read through the issue. What you are describing makes sense to me

With ClusterClass this doesn't happen as we only trigger MD upgrades after CP is upgraded.

But I see how this can happen if CP & MD objects are updated at the same time.

Jul 31 '25 06:07 sbueringer

Note: it is possible to disable MachineSetPreflightChecks, but if you are using kubeadm it may expose you to other issues.

Before implementing a solution please share a possible solution here for further discussion.

/help /triage accepted /priority backlog

Sep 03 '25 13:09 chrischdi

@chrischdi: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

Note: it is possible to disable MachineSetPreflightChecks, but if you are using kubeadm it may expose you to other issues.

Before implementing a solution please share a possible solution here for further discussion.

/help /triage accepted /priority backlog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sep 03 '25 13:09 k8s-ci-robot

Note: it is possible to disable MachineSetPreflightChecks, but if you are using kubeadm it may expose you to other issues.

Yes. Of course I should have mentioned that there are context where we need to benefit from MachineSetPreflightChecks and solve the challenge described in this issue.

Before implementing a solution please share a possible solution here for further discussion.

The solution I have in mind (bird's view) would be to have the MachineDeployment controller downscale an old not-up-to-date MachineSet only if the current MachineSet that would be scaled up to compensate, is passing MachineSetPreflightChecks.

Sep 03 '25 14:09 tmmorin

FYI, we are going to change the machinedeployment controller to account for in-place.

I'm experimenting in https://github.com/fabriziopandini/cluster-api/tree/chained-upgrades When this work will merge. it will also gives unit tests that will allow to consider your proposed changes with confidence

Sep 17 '25 13:09 fabriziopandini

cluster-api cluster-api copied to clipboard

MachineSetPreflightChecks + maxSurge zero results in MDs downscaled for a time longer than desirable

What would you like to be added (User Story)?

Detailed Description

Example Scenario

Current behavior

Problem statement

Relevance

Note

Anything else you would like to add?

Label(s) to be applied

Guidelines

cluster-api
cluster-api copied to clipboard