machine-controller-manager [Extensions] Improve maxSurge/maxUnavailable handling for multi-zone worker pools

How to categorize this issue?

/area usability /kind enhancement /priority 3

What would you like to be added: Today, maxSurge and maxUnavailable values are configured at the worker pool level (ref). Provider extensions usually distribute the configured values if multiple multiple zones are configured (ref).

Although distributing these numbers is generally acceptable, it seems unclear to end-users and thus can end in an unacceptable and unexpected cluster upgrade behavior. This is especially true when maxSurge < len(zones) and maxSurge < len(zones) && maxUnavailable < maxSurge

Example:

    workers:
        name: worker
        machine:
          type: n1-standard-4
          image:
            name: gardenlinux
            version: 318.8.0
        maximum: 5
        minimum: 3
        maxSurge: 1
        maxUnavailable: 0
        zones:
        - europe-west1-a
        - europe-west1-b
        - europe-west1-c

This will result in 3 MachineDeployments:

MachineDeployment	Zone	maxSurge
worker-z1	europe-west1-a	1
worker-z2	europe-west1-b	0
worker-z3	europe-west1-b	0

While the workers in europe-west1-a are upgraded in a rolling fashion, the ones in europe-west1-b and europe-west1-c are just replaced. During the upgrade procedure, the cluster will have less Nodes then configured in workers[*].minimum.

We see the following options to improve this user experience (only when maxSurge < len(zones)):

Change API validation so that maxSurge >= len(zones) --> incompatible and will probably many automation functionalities around Gardener.
Automatically set maxSurge: 1 for each zone (suggested by @AxiomSamarth @himanshu-kun) --> solves many "standard" cases in which maxUnavailable is not used.
The worker actuator sets the configured values zone by zone when an upgrade is performed --> comes close to what is expected by end-users but implies long running worker reconciliations.
Other thoughts?

Why is this needed: Needed for better user experience to avoid unexpected outages.

Oct 25 '21 14:10 timuthy

Other thoughts?

@timuthy maxSurge = ceil(maxSurge / numZones) and maxUnavailable = ceil(maxUnavailable / numZones), i.e. a maxSurge of 1 with 3 zones, turns into 1/3, turns into 1, but it also generally works for larger numbers, though that may not be as important, if at all (but it is more reasonable and less magical).

The main point is that we neglected this problem from the start. It's not only about maxSurge/maxUnavailable. It's also about minimum/maximum. Also those should better use ceil() or, even better yet, the configuration should be "the same per zone". Just like with GKE, for instance. There, if you add a new pool, you specify how man nodes per zone are added (one number times the number of zones results in the overall number of required nodes, which is therefore always an integer multiple of the number per zone - while we on the other hand interpret the number for all zones and then divide by the number of zones and by that have to deal with the division remainder/leave the realm of integer numbers ;-)).

So the next thought is to give up on "hey, it's logical, everything in the configuration per worker pool is meant for the entirety of the machines" (the instance type, the OS, the volume, the container runtime, the taints/labels/annotations, the kubelet configuration). And to be honest, it isn't that "logical" either (which all adds to the end user confusion). For instance dataVolumes is not divided by the number of nodes/zones, but per node. So (unless we want to allow people to tweak the configuration even per zone; probably no use case for that unless in very narrow corner cases where some instance types are "out of stock" in particular zones), we could also have new dedicated properties, e.g. minMachinesPerZone, maxMachinesPerZone, maxSurgeMachinesPerZone, maxUnavailableMachinesPerZone which would finally end the ambiguity.

Oct 25 '21 15:10 vlerenc

@dguendisch, @mvladev, @ialidzhikov, @plkokanov, @timuthy, @vpnachev, @schrodit, @danielfoehrkn, @beckermax, @rfranzke, @timebertt, @hendrikkahl, @kris94, @voelzmo, @stoyanr This issue was referenced by @vlerenc in duplicate issue gardener/autoscaler#104.

Dec 13 '21 09:12 gardener-robot

The Gardener project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed You can:
Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close

/lifecycle stale

Mar 30 '22 14:03 gardener-ci-robot

The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close

/lifecycle rotten

Apr 29 '22 14:04 gardener-ci-robot

The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten

/close

May 29 '22 14:05 gardener-ci-robot

@gardener-ci-robot: Closing this issue.

In response to this:

The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

May 29 '22 14:05 gardener-prow[bot]

/remove-lifecycle rotten

Aug 17 '22 06:08 unmarshall

/reopen

Aug 17 '22 06:08 unmarshall

@unmarshall: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 17 '22 06:08 gardener-prow[bot]

The Gardener project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close

/lifecycle stale

Nov 15 '22 06:11 gardener-ci-robot

/remove-lifecycle stale

Nov 15 '22 06:11 unmarshall

/assign

Nov 15 '22 06:11 unmarshall

The Gardener project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close

/lifecycle stale

Feb 13 '23 07:02 gardener-ci-robot

GKE upgrade process -> https://cloud.google.com/kubernetes-engine/docs/concepts/node-pool-upgrade-strategies

They support blue-green upgrades, but we don't. They upgrade one zone at a time, with their parameters working on only zone, while our upgrade works in parallel over all zones, but our parameters are split over zones , and current issue was opened for one such case in splitting. Both approaches have their pros and cons. Also they default the configuration to maxSurge=1, maxUnavailable=0 for their worker pool , if not provided. One solution recommended (see issue description) suggested similar.

Feb 27 '23 07:02 himanshu-kun

Pros of one zone at a time:

we ensure a failing rolling update doesn't impact all the zones together (staged zone rollout)
a pod de-scheduled in one zone due to rolling update won't go to another zone, which sometimes happen in parallel rolling update
there is a clear demarcation of when to move to a new zone and roll it, either on completion of the last one or if last one didn't progress for progressDeadlineSeconds
There is no need to distribute the maxSurge/maxUnavailable amongst all zones as it will apply only to the zone for which an update is in progress.

Mar 02 '23 09:03 himanshu-kun

The main point is that we neglected this problem from the start. It's not only about maxSurge/maxUnavailable. It's also about minimum/maximum. Also those should better use ceil() or, even better yet, the configuration should be "the same per zone".

How maximum is currently calculated can cause unexpected node termination through adding zones. For example if you have 12 workers in zone A and a maximum of 30 then adding two more zones means that the new maximum of zone A will be 10 workers. This will lead to node termination in zone A. This is something an operator/shoot owner can easily be surprised by and can impact workload. Therefore, I vote to change that. Ideas would be to specify the maximum per zone or let the maximum be the upper limit for the sum of the nodes in all zones.

Jun 20 '23 16:06 BeckerMax

We see some potential to improve the API wrt to the worker configuration. I opened https://github.com/gardener/gardener/issues/8142 for further discussion.

Jun 22 '23 11:06 timuthy

:warning: I feel the https://github.com/gardener/machine-controller-manager/issues/798#issuecomment-1599149933 from @BeckerMax needs a more urgent attention and should be tracked in a separate issue to expedite its handling vs the perfect design to achieve the ask of allowing per zone configuration.

I did a simple experiment to check and found that this will lead to rolling of machines/pods in the current setup and may go unnoticed if the current max can accommodate the additional machines required/zone.

Expand to the see the experiment and the findings.

Created a shoot with worker pool of machine type m5.large(2cpu/8gb) with min=2, max=3, maxSurge=1
Create a nginx-deployment with replicas=2 and cpuRequest=1000m and deployed on the shoot.

Checking the running instance we can see 3 nodes and 2 pods running.

❯ kgno
NAME                                          STATUS   ROLES    AGE   VERSION
ip-10-180-10-43.eu-west-1.compute.internal    Ready    <none>   12m   v1.26.5
ip-10-180-14-176.eu-west-1.compute.internal   Ready    <none>   10m   v1.26.5
ip-10-180-23-35.eu-west-1.compute.internal    Ready    <none>   12m   v1.26.5
❯ kgp -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP           NODE                                          NOMINATED NODE   READINESS GATES
nginx-deployment-85996f8dbd-5bnt8   1/1     Running   0          89s   100.64.1.9   ip-10-180-10-43.eu-west-1.compute.internal    <none>           <none>
nginx-deployment-85996f8dbd-nwwb2   1/1     Running   0          89s   100.64.2.8   ip-10-180-14-176.eu-west-1.compute.internal   <none>           <none>

Edited the worker-pool by just adding 2 additional zones without changing the min/max/surge values. We can see that old running nodes are now replaced with new nodes are created in the 2 new zones and it also leads to the pods being moved to the new nodes.

 ❯ kgno
 NAME                                          STATUS   ROLES    AGE   VERSION
 ip-10-180-131-9.eu-west-1.compute.internal    Ready    <none>   49s   v1.26.5
 ip-10-180-14-176.eu-west-1.compute.internal   Ready    <none>   32m   v1.26.5
 ip-10-180-88-220.eu-west-1.compute.internal   Ready    <none>   89s   v1.26.5
 ❯ kgp -o wide #done after I started writing this comments so its running for sometime now. 
 NAME                                READY   STATUS    RESTARTS   AGE   IP           NODE                                          NOMINATED NODE   READINESS GATES
 nginx-deployment-85996f8dbd-4m2nv   1/1     Running   0          35m   100.64.4.2   ip-10-180-131-9.eu-west-1.compute.internal    <none>           <none>
 nginx-deployment-85996f8dbd-g8qrt   1/1     Running   0          37m   100.64.3.2   ip-10-180-88-220.eu-west-1.compute.internal   <none>           <none>

In this case we had capacity with max machines able to handle the changes. However, before new zones were added if we had lesser capacity we could also end up with these nodes in pending state which is bad for the end user as their running workloads are now in pending status with no scheduling possible unless they reach out to operators for help.
The risk becomes more pertinent when running stateful workload as that might lead to downtime for the service.

I would propose:

If possible to detect the current active machines on a singleZone cluster
Validate that when making the worker-pool multi-zonal the new min value of perZone machineDeployment for the existing zone is not < then the current running machines.
If found false then if the max allows MCM should ensure that the min of the new per zone machineDeployments is >= the running machines in the existing zones.
otherwise if it can't be achieved in an automated way then we should either fail the shoot.yaml validation or fail the shoot reconciliation to ensure this change is not applied.

We can enhance the current dashboard warning by either proposing the min values or direct him of the consequences.

wdyt?

Jun 30 '23 08:06 ashwani2k

FYI: Proposal https://gist.github.tools.sap/D043832/f5d0ac0e0bb138eea07c84ac79e10ce9#step-1-enhancing-zone-distribution-and-surge-management (only step 1 relevant here)

Jul 27 '24 19:07 vlerenc

machine-controller-manager machine-controller-manager copied to clipboard

[Extensions] Improve maxSurge/maxUnavailable handling for multi-zone worker pools

machine-controller-manager
machine-controller-manager copied to clipboard