machine-controller-manager icon indicating copy to clipboard operation
machine-controller-manager copied to clipboard

Changing the Gardener Shoot node's worker group name affects all running workloads and causes down time.

Open bollmann opened this issue 3 years ago • 5 comments

How to categorize this issue? /area control-plane /kind enhancement /priority 3

What happened?

We accidentally changed the worker group name inside our productive Gardener shoot node manifest from "edp-g-pe1" to "default". This caused all the worker shoot nodes belonging to the old worker group to be drained and be shut down. Afterwards, the worker shoots were added to the worker group with the new name and recreated. Unfortunately, this procedure caused a full outage of our system. We didn't expect this accidental worker group name change to have such severe consequences. More details about what happened to us are explained inside this Post-Mortem.

What did you expect to happen?

We expected this worker group name change inside Gardener's shoot.yml to not cause a down-time. This is the normal expectation that we've come to get used to when dealing with K8s CRDs. If, however, changing the worker group name inevitably and necessarily means a full outage of the system running atop, it should not be as easy to configure this wrongly. Therefore, we wonder the following:

  1. Would it be possible to carry out a change of the worker group name to not cause any down-time? For example, if the underlying shoot nodes didn't carry the worker group name as part of their machine name, would it then be possible to make a worker group name change less destructive? Or instead of first draining the old worker group's workers and then spawning them into the new worker group, wouldn't it be better to first create the workers inside the new worker group and then drain the workers from the old worker group?
  2. If Option 1.) from above is not possible, wouldn't it be more resilient to forbid such a dangerous and down-time causing change of the worker group name? So far we've found it just too easy to change the Gardener shoot.yml's worker group name and hereby temporarily destroy our productive Gardener cluster.
  3. Moreover, the current dangerous behavior of a worker group name change doesn't seem to be documented inside the Core APIs. In our opinion it should be, though.

How would we reproduce it (concisely and precisely)?

Create a K8s cluster inside AWS with a couple of worker shoot nodes that belong to, say, worker group "foo". Then edit the shoot nodes yaml by changing the worker group name to "bar". Observe how all the shoot nodes from the old worker group "foo" get drained and deleted before being recreated as part of the new worker group "bar". Also observe how this affects all the workloads running inside the K8s cluster.

bollmann avatar May 12 '22 15:05 bollmann

@bollmann Label area/control-plan does not exist.

gardener-robot avatar May 12 '22 15:05 gardener-robot

@bollmann You have mentioned internal references in the public. Please check.

gardener-robot avatar May 12 '22 15:05 gardener-robot

@bollmann You have mentioned internal references in the public. Please check.

gardener-robot avatar May 12 '22 15:05 gardener-robot

cc @vlerenc

himanshu-kun avatar May 24 '22 05:05 himanshu-kun

Thanks @bollmann. You are right, ideally we should prevent that (option #2). It is not reasonably possible to rename a pool (option #1) as the name is encoded everywhere, also so that you can more easily understand what the nodes are doing (on the infrastructure provider side as well as from within Kubernetes). GKE also doesn't offer pool rename.

@himanshu-kun @dkistner Can we implement a safe-guard in the admission controller? I guess, it will be difficult/impossible to discriminate/differentiate/disambiguate the renaming of a pool from removing one and adding another one that is identical except for the name (;-))?

vlerenc avatar May 24 '22 07:05 vlerenc