machine-controller-manager
machine-controller-manager copied to clipboard
[Extensions] Improve maxSurge/maxUnavailable handling for multi-zone worker pools
How to categorize this issue?
/area usability /kind enhancement /priority 3
What would you like to be added:
Today, maxSurge
and maxUnavailable
values are configured at the worker pool level (ref). Provider extensions usually distribute the configured values if multiple multiple zones are configured (ref).
Although distributing these numbers is generally acceptable, it seems unclear to end-users and thus can end in an unacceptable and unexpected cluster upgrade behavior. This is especially true when maxSurge < len(zones)
and maxSurge < len(zones) && maxUnavailable < maxSurge
Example:
workers:
name: worker
machine:
type: n1-standard-4
image:
name: gardenlinux
version: 318.8.0
maximum: 5
minimum: 3
maxSurge: 1
maxUnavailable: 0
zones:
- europe-west1-a
- europe-west1-b
- europe-west1-c
This will result in 3 MachineDeployments
:
MachineDeployment | Zone | maxSurge | maxUnavailable |
---|---|---|---|
worker-z1 | europe-west1-a | 1 | 0 |
worker-z2 | europe-west1-b | 0 | 0 |
worker-z3 | europe-west1-b | 0 | 0 |
While the workers in europe-west1-a
are upgraded in a rolling fashion, the ones in europe-west1-b
and europe-west1-c
are just replaced. During the upgrade procedure, the cluster will have less Node
s then configured in workers[*].minimum
.
We see the following options to improve this user experience (only when maxSurge < len(zones)
):
- Change API validation so that
maxSurge >= len(zones)
--> incompatible and will probably many automation functionalities around Gardener. - Automatically set
maxSurge: 1
for each zone (suggested by @AxiomSamarth @himanshu-kun) --> solves many "standard" cases in whichmaxUnavailable
is not used. - The worker actuator sets the configured values zone by zone when an upgrade is performed --> comes close to what is expected by end-users but implies long running worker reconciliations.
- Other thoughts?
Why is this needed: Needed for better user experience to avoid unexpected outages.
Other thoughts?
@timuthy maxSurge = ceil(maxSurge / numZones)
and maxUnavailable = ceil(maxUnavailable / numZones)
, i.e. a maxSurge
of 1
with 3
zones, turns into 1/3
, turns into 1
, but it also generally works for larger numbers, though that may not be as important, if at all (but it is more reasonable and less magical).
The main point is that we neglected this problem from the start. It's not only about maxSurge
/maxUnavailable
. It's also about minimum
/maximum
. Also those should better use ceil()
or, even better yet, the configuration should be "the same per zone". Just like with GKE, for instance. There, if you add a new pool, you specify how man nodes per zone are added (one number times the number of zones results in the overall number of required nodes, which is therefore always an integer multiple of the number per zone - while we on the other hand interpret the number for all zones and then divide by the number of zones and by that have to deal with the division remainder/leave the realm of integer numbers ;-)).
So the next thought is to give up on "hey, it's logical, everything in the configuration per worker pool is meant for the entirety of the machines" (the instance type, the OS, the volume, the container runtime, the taints/labels/annotations, the kubelet
configuration). And to be honest, it isn't that "logical" either (which all adds to the end user confusion). For instance dataVolumes
is not divided by the number of nodes/zones, but per node. So (unless we want to allow people to tweak the configuration even per zone; probably no use case for that unless in very narrow corner cases where some instance types are "out of stock" in particular zones), we could also have new dedicated properties, e.g. minMachinesPerZone
, maxMachinesPerZone
, maxSurgeMachinesPerZone
, maxUnavailableMachinesPerZone
which would finally end the ambiguity.
@dguendisch, @mvladev, @ialidzhikov, @plkokanov, @timuthy, @vpnachev, @schrodit, @danielfoehrkn, @beckermax, @rfranzke, @timebertt, @hendrikkahl, @kris94, @voelzmo, @stoyanr This issue was referenced by @vlerenc in duplicate issue gardener/autoscaler#104.
The Gardener project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed You can: - Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
/lifecycle stale
The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
/lifecycle rotten
The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue or PR with
/reopen
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
/close
@gardener-ci-robot: Closing this issue.
In response to this:
The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue or PR with
/reopen
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/remove-lifecycle rotten
/reopen
@unmarshall: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The Gardener project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
/lifecycle stale
/remove-lifecycle stale
/assign
The Gardener project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
/lifecycle stale
GKE upgrade process -> https://cloud.google.com/kubernetes-engine/docs/concepts/node-pool-upgrade-strategies
They support blue-green upgrades, but we don't.
They upgrade one zone at a time, with their parameters working on only zone, while our upgrade works in parallel over all zones, but our parameters are split over zones , and current issue was opened for one such case in splitting.
Both approaches have their pros and cons.
Also they default the configuration to maxSurge=1, maxUnavailable=0
for their worker pool , if not provided. One solution recommended (see issue description) suggested similar.
Pros of one zone at a time:
- we ensure a failing rolling update doesn't impact all the zones together (staged zone rollout)
- a pod de-scheduled in one zone due to rolling update won't go to another zone, which sometimes happen in parallel rolling update
- there is a clear demarcation of when to move to a new zone and roll it, either on completion of the last one or if last one didn't progress for
progressDeadlineSeconds
- There is no need to distribute the maxSurge/maxUnavailable amongst all zones as it will apply only to the zone for which an update is in progress.
The main point is that we neglected this problem from the start. It's not only about maxSurge/maxUnavailable. It's also about minimum/maximum. Also those should better use ceil() or, even better yet, the configuration should be "the same per zone".
How maximum is currently calculated can cause unexpected node termination through adding zones. For example if you have 12 workers in zone A
and a maximum of 30
then adding two more zones means that the new maximum of zone A
will be 10 workers. This will lead to node termination in zone A
.
This is something an operator/shoot owner can easily be surprised by and can impact workload. Therefore, I vote to change that. Ideas would be to specify the maximum per zone or let the maximum be the upper limit for the sum of the nodes in all zones.
We see some potential to improve the API wrt to the worker configuration. I opened https://github.com/gardener/gardener/issues/8142 for further discussion.
:warning: I feel the https://github.com/gardener/machine-controller-manager/issues/798#issuecomment-1599149933 from @BeckerMax needs a more urgent attention and should be tracked in a separate issue to expedite its handling vs the perfect design to achieve the ask of allowing per zone configuration.
I did a simple experiment to check and found that this will lead to rolling of machines/pods in the current setup and may go unnoticed if the current max can accommodate the additional machines required/zone.
Expand to the see the experiment and the findings.
- Created a shoot with worker pool of machine type m5.large(2cpu/8gb) with min=2, max=3, maxSurge=1
- Create a nginx-deployment with replicas=2 and cpuRequest=1000m and deployed on the shoot.
- Checking the running instance we can see 3 nodes and 2 pods running.
❯ kgno NAME STATUS ROLES AGE VERSION ip-10-180-10-43.eu-west-1.compute.internal Ready <none> 12m v1.26.5 ip-10-180-14-176.eu-west-1.compute.internal Ready <none> 10m v1.26.5 ip-10-180-23-35.eu-west-1.compute.internal Ready <none> 12m v1.26.5 ❯ kgp -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-deployment-85996f8dbd-5bnt8 1/1 Running 0 89s 100.64.1.9 ip-10-180-10-43.eu-west-1.compute.internal <none> <none> nginx-deployment-85996f8dbd-nwwb2 1/1 Running 0 89s 100.64.2.8 ip-10-180-14-176.eu-west-1.compute.internal <none> <none>
- Edited the worker-pool by just adding 2 additional zones without changing the min/max/surge values. We can see that old running nodes are now replaced with new nodes are created in the 2 new zones and it also leads to the pods being moved to the new nodes.
❯ kgno NAME STATUS ROLES AGE VERSION ip-10-180-131-9.eu-west-1.compute.internal Ready <none> 49s v1.26.5 ip-10-180-14-176.eu-west-1.compute.internal Ready <none> 32m v1.26.5 ip-10-180-88-220.eu-west-1.compute.internal Ready <none> 89s v1.26.5 ❯ kgp -o wide #done after I started writing this comments so its running for sometime now. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-deployment-85996f8dbd-4m2nv 1/1 Running 0 35m 100.64.4.2 ip-10-180-131-9.eu-west-1.compute.internal <none> <none> nginx-deployment-85996f8dbd-g8qrt 1/1 Running 0 37m 100.64.3.2 ip-10-180-88-220.eu-west-1.compute.internal <none> <none>
- In this case we had capacity with max machines able to handle the changes. However, before new zones were added if we had lesser capacity we could also end up with these nodes in pending state which is bad for the end user as their running workloads are now in pending status with no scheduling possible unless they reach out to operators for help.
- The risk becomes more pertinent when running stateful workload as that might lead to downtime for the service.
I would propose:
- If possible to detect the current active machines on a singleZone cluster
- Validate that when making the worker-pool multi-zonal the new min value of perZone machineDeployment for the existing zone is not < then the current running machines.
- If found false then if the max allows MCM should ensure that the min of the new per zone machineDeployments is >= the running machines in the existing zones.
- otherwise if it can't be achieved in an automated way then we should either fail the shoot.yaml validation or fail the shoot reconciliation to ensure this change is not applied.
We can enhance the current dashboard warning by either proposing the min values or direct him of the consequences.
wdyt?
FYI: Proposal https://gist.github.tools.sap/D043832/f5d0ac0e0bb138eea07c84ac79e10ce9#step-1-enhancing-zone-distribution-and-surge-management (only step 1 relevant here)