cluster-api
cluster-api copied to clipboard
MachineSetPreflightChecks + maxSurge zero results in MDs downscaled for a time longer than desirable
What would you like to be added (User Story)?
As a user, when I use maxSurge zero, I would like the time where my MachineDeploymenst are downscaled to be as short as possible, during a cluster Kubernetes upgrade -- today they are downscaled at once when the MD is updated and this remains true during the whole duration of the control plane nodes rolling update.
The evolution that would be desirable would be a behavior where, when maxSurge zero is used, the old ReplicaSet for a MachineDeployment is not downscaled until the new ReplicaSet will be able to be upscaled (based on MachineSetPreflightChecks).
Said otherwise:
- with current 1.9.x code, an MD is not at its target number of replicas for a duration which is: time to update CP nodes + time to update the MD nodes
- ideally, an MD would not be at its target number of replicas for a duration which would just be the time to update the MD nodes
Detailed Description
Example Scenario
Let's consider the following example scenario:
- MachineSetPreflightChecks are enabled (they have been enabled by default since 1.9.x)
- one or more MachineDeployments are used
- they use RollingUpdate strategy with maxSurge is set to zero
- CAPI resources for a cluster are updated to trigger a Kubernetes version upgrade (CP and MDs)
Current behavior
What happens today is the following:
- a CP node rolling update is triggered and starts
- at once, for all MDs, a new ReplicaSet is created, and the previous one is downscaled (maxSurge 0)
- at this point all MDs are downscaled, and this will persist until the CP node rolling update is finished (which on baremetal isn't quick, ~1h being a quite typical order of magnitude -- 3 nodes, each rebuild in 20 minutes)
Problem statement
With maxSurge zero, during a Kubernetes upgrade, all MDs will be downscaled by one for a time longer than desirable (during the time needed to roll'update all the CP nodes), while ideally the MDs could remain untouched during the CP nodes rolling update.
The evolution that would be desirable would be a behavior where the old ReplicaSet is not downscaled until the new ReplicaSet will be able to be upscaled.
Relevance
The scenario above is common place for CAPI baremetal deployments (e.g. with capm3) where it is common to not have spare hardware. In baremetal low-footprint scenarios where clusters have a low number of nodes, the difference in terms of available processing resources can be significant.
Note
I opted for filing this as a "feature request", but my feeling is that some might qualify the current behavior as a regression. Please feel free to requalify as a "bug report" if you think this is deserved.
Anything else you would like to add?
No response
Label(s) to be applied
/kind feature /area machinedeployment /area upgrades
Question: Are you using clusterclass or how are you orchestrating it?
Would be good to have a more fine-grained example (with yaml's?).
Question: Are you using clusterclass or how are you orchestrating it?
No clusterclass.
And no orchestration either, I would say, although I'm not 100% sure what you mean by "orchestration" in the context here.
We're using a single Helm chart (here) that produces both the CAPI control plane resource and the MDs (it updates them all the same time).
Would be good to have a more fine-grained example (with yaml's?).
Can you tell me what part of the manifest is relevant (our manifests produced are fairly standard, we set k8s version under KubeadmControlplane.spec.version and MachineDeployment.spec.template.spec.version .
Finally found the time to read through the issue. What you are describing makes sense to me
With ClusterClass this doesn't happen as we only trigger MD upgrades after CP is upgraded.
But I see how this can happen if CP & MD objects are updated at the same time.
Note: it is possible to disable MachineSetPreflightChecks, but if you are using kubeadm it may expose you to other issues.
Before implementing a solution please share a possible solution here for further discussion.
/help /triage accepted /priority backlog
@chrischdi: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
Note: it is possible to disable MachineSetPreflightChecks, but if you are using kubeadm it may expose you to other issues.
Before implementing a solution please share a possible solution here for further discussion.
/help /triage accepted /priority backlog
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Note: it is possible to disable MachineSetPreflightChecks, but if you are using kubeadm it may expose you to other issues.
Yes. Of course I should have mentioned that there are context where we need to benefit from MachineSetPreflightChecks and solve the challenge described in this issue.
Before implementing a solution please share a possible solution here for further discussion.
The solution I have in mind (bird's view) would be to have the MachineDeployment controller downscale an old not-up-to-date MachineSet only if the current MachineSet that would be scaled up to compensate, is passing MachineSetPreflightChecks.
FYI, we are going to change the machinedeployment controller to account for in-place.
I'm experimenting in https://github.com/fabriziopandini/cluster-api/tree/chained-upgrades When this work will merge. it will also gives unit tests that will allow to consider your proposed changes with confidence