gardener ☂️ [GEP-20] Highly Available Shoot Control Planes

How to categorize this issue?

/area high-availability /kind enhancement

What would you like to be added:

This is an umbrella issue to track the implementation of GEP-20 Highly Available Shoot Control Planes.

[x] https://github.com/gardener/gardener/pull/6530
Add validations for updating the shoot.spec.controlPlanes field
- [x] Allow non-HA shoot -> HA shoot
- [x] Only allow non-HA -> multi-zone if assigned seed is multi-zonal
- [x] Single-zone HA shoot <-> multi-zone HA shoot must not be allowed
- [x] HA shoot -> non-HA shoot must not be allowed (until etcd scale-down is implemented)
[x] Enhance validations for shoot HA annotation
[x] https://github.com/gardener/gardener/pull/6665
[ ] https://github.com/gardener/gardener/pull/6723
[x] https://github.com/gardener/gardener/pull/6579
[ ] Enable multi-zonal deployments for more control plane components
- [x] https://github.com/gardener/gardener/pull/6674
- [ ] #6685
[x] ~~Enhance Pod eviction in case of zone outage (delete Pods in Terminating state), see https://github.com/gardener/gardener/pull/6646#issuecomment-1243536333~~ (@ialidzhikov) -> see https://github.com/gardener/gardener/issues/6529#issuecomment-1254800255
[ ] 🚧 Change the seed system component replicas based on the label or the spec for Multi-Zonal seeds (@timuthy)
[ ] https://github.com/gardener/gardener/issues/6718
[ ] Progress HAControlPlanes feature gate to beta
[ ] Deprecate shoot HA annotation

Aug 18 '22 18:08 shreyas-s-rao

/assign /assign @timuthy

Aug 18 '22 19:08 shreyas-s-rao

[ ] Introduce shoot spec field for enabling HA control planes

Add validations for updating the shoot.spec.controlPlanes field

[ ] Allow non-HA shoot -> HA shoot

[ ] Only allow non-HA -> multi-zone if assigned seed is multi-zonal

[ ] Single-zone HA shoot <-> multi-zone HA shoot must not be allowed

[ ] HA shoot -> non-HA shoot must not be allowed (until etcd scale-down is implemented)

This needs some modifications along with a change required for Seed. Change needs to be part of GEP -> GEP enhancement PR | Implementation--> Review

We reprioritised in discussion with @timuthy

[ ] Support zone-pinning for single-zone control HA planes (via GRM mutating webhook)

Aug 22 '22 09:08 ashwani2k

Below bullet points highlight the api contract for introducing HA control planes via -

  controlPlane:
    highAvailability:
      failureTolerance:
        type:  <node|zone>

non-HA shoots can be scheduled on non-HA or HA(multi-zone) seeds.
single-zone shoots can be scheduled on non-HA or HA(multi-zone) seeds.
multi-zone shoots can only be scheduled ONLY on HA(multi-zone) seeds.
non-HA shoots can be upgraded to single-zone on non-HA or HA seeds. **
non-HA shoots can be upgraded to multi-zone only on HA seeds. **
single-zone shoots shall not be allowed to upgrade to multi-zone shoots and shall be stopped by admission plugins.

** this can lead to a short disruption|downtime when etcd sts is rolled

Legend: non-HA shoot : any shoot which has no faultTolerance defined. single-zone shoot: any shoot which has faultTolerance defined as type node. multi-zone shoot: any shoot which has faultTolerance defined as type zone. non-HA seed: any seed which has worker pools for etcd/cpu running only on a single availability zone. HA seed: any seed which has worker pools for etcd/cpu defined across 3 availability zones and has the label seed.gardener.cloud/multi-zonal: "true".

Sep 08 '22 04:09 ashwani2k

Wrt to Enhance Pod eviction in case of zone outage (delete Pods in Terminating state): for Deployments the kube-controller-manager behaviour is to create new Pods right away when the old Pods are Terminating.

Example:

Expand to see the Deployment!

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: nginx
        image: centos
        command: ["/bin/sh"]
        args: ["-c", "sleep 3600"]
        ports:
        - containerPort: 80

Above we have a Deployment. Its container does not handle SIGTERM and it will hang in Terminating until it is force killed after terminationGracePeriodSeconds.

$ k get po
NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-746759f465-z95lj   1/1     Running   0          11m

$ k delete po nginx-deployment-746759f465-z95lj
pod "nginx-deployment-746759f465-z95lj" deleted

$ k get po
NAME                                READY   STATUS              RESTARTS   AGE
nginx-deployment-746759f465-z95lj   1/1     Terminating         0          11m
nginx-deployment-746759f465-ztz65   0/1     ContainerCreating   0          2s

$ k get po
NAME                                READY   STATUS        RESTARTS   AGE
nginx-deployment-746759f465-z95lj   1/1     Terminating   0          12m
nginx-deployment-746759f465-ztz65   1/1     Running       0          54s

Above you can see that when old replica is deleted, the new one is created right away.

I suspect that in the experiments of @unmarshall (ref https://github.com/gardener/gardener/pull/6287#discussion_r942324110) there was a webhook preventing creation of new Pods (for some unknown reason) or kube-controller-manager was down (for some unknown reason). These are the 2 potential things that could explain https://github.com/gardener/gardener/pull/6287#discussion_r942324110.

Anyways, I will try to simulate a zone outage and check why KCM does not create the new Pods when the old ones are terminating.

Sep 21 '22 10:09 ialidzhikov

We had a sync with @unmarshall and we are able to confirm that in a simulation of zone outage (simulated via network acl that denies all ingress and egress traffic for a zone) the recovery for a (multi-zone) control plane worked well as outlined in https://github.com/gardener/gardener/issues/6529#issuecomment-1253517049:

For Deployments kube-controller-manager creates new replicas right away when the old replicas are terminating. The new replicas start successfully on a healthy zone.
I think during @unmarshall's simulations kube-controller-manager was down for some reason. I also revised the webhooks we deploy and whether we could have a deadlock situation that could block new Pod creation but I didn't see anything abnormal.

PS: We also found that the existing garbage-collector (shoot-care-controller of gardenlet) already deletes Terminating pods in the Shoot's control plane after 5min. https://github.com/gardener/gardener/blob/24b667c39f4ff1c6e733ce98816e4f6127a2476f/pkg/operation/care/garbage_collection.go#L85-L93 But this is not a recovery mechanism and does not bring to the recovery. For Deployments kube-controller-manager already creates the new replicas. For StatefulSets, even when the old Terminating replicas are forcefully deleted, this does not lead to a recovery as the new StatefulSet Pods fail to be scheduled - they have scheduling requirements that cannot be satisfied during the zone outage (etcd Pod to run on the outage zone or loki/prometheus Pods to run on the outage zone because their volume is already on provisioned on this zone).

TL;DR: We will resolve the corresponding item as completed as nothing has to be done. Let us know if you have additional comments on this topic. We have to update GEP-20 with the new learnings.

Sep 22 '22 09:09 ialidzhikov

Great to hear! Thank you!

Sep 23 '22 04:09 vlerenc

I added another item Support control-plane migration for HA shoots since this doesn't seem to work out of the box. We should create a separate issue once we have more certainty and details and find a proper way to support this use-case.

cc @plkokanov @vlerenc

Sep 30 '22 08:09 timuthy

I added another item Support control-plane migration for HA shoots since this doesn't seem to work out of the box. We should create a separate issue once we have more certainty and details and find a proper way to support this use-case.

cc @plkokanov @vlerenc

Should we (for now) add validation that forbids migration for HA shoots?

Sep 30 '22 10:09 plkokanov

/assign @rfranzke

for the tasks related to

Configure high-availability settings for {seed system, shoot control plane, shoot system} components

(except for the Remainders/special cases)

Oct 24 '22 12:10 rfranzke

/assign @rfranzke

for the tasks related to

Configure high-availability settings for {seed system, shoot control plane, shoot system} components

(except for the Remainders/special cases)

All related PRs in gardener/gardener have been merged. /unassign

Nov 16 '22 13:11 rfranzke

/assign @plkokanov @ishan16696 for tasks related to

Support control-plane migration for HA shoots

Nov 22 '22 10:11 timuthy

/assign @plkokanov @ishan16696 for tasks related to

Please see the approaches possible to achieve CPM in multi-node etcd: https://github.com/gardener/etcd-druid/issues/479#issuecomment-1365793557

Feb 06 '23 06:02 ishan16696

All tasks have been completed. /close

May 16 '23 08:05 rfranzke

@rfranzke: Closing this issue.

In response to this:

All tasks have been completed. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

May 16 '23 08:05 gardener-prow[bot]