☂️ [GEP-20] Highly Available Shoot Control Planes
How to categorize this issue?
/area high-availability /kind enhancement
What would you like to be added:
This is an umbrella issue to track the implementation of GEP-20 Highly Available Shoot Control Planes.
- [x] https://github.com/gardener/gardener/pull/6530
- Add validations for updating the
shoot.spec.controlPlanesfield- [x] Allow non-HA shoot -> HA shoot
- [x] Only allow non-HA -> multi-zone if assigned seed is multi-zonal
- [x] Single-zone HA shoot <-> multi-zone HA shoot must not be allowed
- [x] HA shoot -> non-HA shoot must not be allowed (until etcd scale-down is implemented)
- [x] Enhance validations for shoot HA annotation
- [x] https://github.com/gardener/gardener/pull/6665
- [ ] https://github.com/gardener/gardener/pull/6723
- [x] https://github.com/gardener/gardener/pull/6579
- [ ] Enable multi-zonal deployments for more control plane components
- [x] https://github.com/gardener/gardener/pull/6674
- [ ] #6685
- [x] ~~Enhance Pod eviction in case of zone outage (delete Pods in
Terminatingstate), see https://github.com/gardener/gardener/pull/6646#issuecomment-1243536333~~ (@ialidzhikov) -> see https://github.com/gardener/gardener/issues/6529#issuecomment-1254800255 - [ ] 🚧 Change the seed system component replicas based on the label or the spec for Multi-Zonal seeds (@timuthy)
- [ ] https://github.com/gardener/gardener/issues/6718
- [ ] Progress
HAControlPlanesfeature gate to beta - [ ] Deprecate shoot HA annotation
/assign /assign @timuthy
- [ ] Introduce shoot spec field for enabling HA control planes
- Add validations for updating the
shoot.spec.controlPlanesfield
- [ ] Allow non-HA shoot -> HA shoot
- [ ] Only allow non-HA -> multi-zone if assigned seed is multi-zonal
- [ ] Single-zone HA shoot <-> multi-zone HA shoot must not be allowed
- [ ] HA shoot -> non-HA shoot must not be allowed (until etcd scale-down is implemented)
This needs some modifications along with a change required for Seed. Change needs to be part of GEP -> GEP enhancement PR | Implementation--> Review
We reprioritised in discussion with @timuthy
- [ ] Support zone-pinning for single-zone control HA planes (via GRM mutating webhook)
Below bullet points highlight the api contract for introducing HA control planes via -
controlPlane:
highAvailability:
failureTolerance:
type: <node|zone>
-
non-HAshoots can be scheduled onnon-HAorHA(multi-zone) seeds. -
single-zoneshoots can be scheduled onnon-HAorHA(multi-zone) seeds. -
multi-zoneshoots can only be scheduled ONLY on HA(multi-zone) seeds. -
non-HAshoots can be upgraded tosingle-zoneonnon-HAorHAseeds. ** -
non-HAshoots can be upgraded tomulti-zoneonly onHAseeds. ** -
single-zoneshoots shall not be allowed to upgrade tomulti-zoneshoots and shall be stopped by admission plugins.
** this can lead to a short disruption|downtime when etcd sts is rolled
Legend:
non-HA shoot : any shoot which has no faultTolerance defined.
single-zone shoot: any shoot which has faultTolerance defined as type node.
multi-zone shoot: any shoot which has faultTolerance defined as type zone.
non-HA seed: any seed which has worker pools for etcd/cpu running only on a single availability zone.
HA seed: any seed which has worker pools for etcd/cpu defined across 3 availability zones and has the label seed.gardener.cloud/multi-zonal: "true".
Wrt to Enhance Pod eviction in case of zone outage (delete Pods in Terminating state): for Deployments the kube-controller-manager behaviour is to create new Pods right away when the old Pods are Terminating.
Example:
Expand to see the Deployment!
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
terminationGracePeriodSeconds: 60
containers:
- name: nginx
image: centos
command: ["/bin/sh"]
args: ["-c", "sleep 3600"]
ports:
- containerPort: 80
Above we have a Deployment. Its container does not handle SIGTERM and it will hang in Terminating until it is force killed after terminationGracePeriodSeconds.
$ k get po
NAME READY STATUS RESTARTS AGE
nginx-deployment-746759f465-z95lj 1/1 Running 0 11m
$ k delete po nginx-deployment-746759f465-z95lj
pod "nginx-deployment-746759f465-z95lj" deleted
$ k get po
NAME READY STATUS RESTARTS AGE
nginx-deployment-746759f465-z95lj 1/1 Terminating 0 11m
nginx-deployment-746759f465-ztz65 0/1 ContainerCreating 0 2s
$ k get po
NAME READY STATUS RESTARTS AGE
nginx-deployment-746759f465-z95lj 1/1 Terminating 0 12m
nginx-deployment-746759f465-ztz65 1/1 Running 0 54s
Above you can see that when old replica is deleted, the new one is created right away.
I suspect that in the experiments of @unmarshall (ref https://github.com/gardener/gardener/pull/6287#discussion_r942324110) there was a webhook preventing creation of new Pods (for some unknown reason) or kube-controller-manager was down (for some unknown reason). These are the 2 potential things that could explain https://github.com/gardener/gardener/pull/6287#discussion_r942324110.
Anyways, I will try to simulate a zone outage and check why KCM does not create the new Pods when the old ones are terminating.
We had a sync with @unmarshall and we are able to confirm that in a simulation of zone outage (simulated via network acl that denies all ingress and egress traffic for a zone) the recovery for a (multi-zone) control plane worked well as outlined in https://github.com/gardener/gardener/issues/6529#issuecomment-1253517049:
- For Deployments kube-controller-manager creates new replicas right away when the old replicas are terminating. The new replicas start successfully on a healthy zone.
- I think during @unmarshall's simulations kube-controller-manager was down for some reason. I also revised the webhooks we deploy and whether we could have a deadlock situation that could block new Pod creation but I didn't see anything abnormal.
PS: We also found that the existing garbage-collector (shoot-care-controller of gardenlet) already deletes Terminating pods in the Shoot's control plane after 5min.
https://github.com/gardener/gardener/blob/24b667c39f4ff1c6e733ce98816e4f6127a2476f/pkg/operation/care/garbage_collection.go#L85-L93
But this is not a recovery mechanism and does not bring to the recovery. For Deployments kube-controller-manager already creates the new replicas. For StatefulSets, even when the old Terminating replicas are forcefully deleted, this does not lead to a recovery as the new StatefulSet Pods fail to be scheduled - they have scheduling requirements that cannot be satisfied during the zone outage (etcd Pod to run on the outage zone or loki/prometheus Pods to run on the outage zone because their volume is already on provisioned on this zone).
TL;DR: We will resolve the corresponding item as completed as nothing has to be done. Let us know if you have additional comments on this topic. We have to update GEP-20 with the new learnings.
Great to hear! Thank you!
I added another item Support control-plane migration for HA shoots since this doesn't seem to work out of the box. We should create a separate issue once we have more certainty and details and find a proper way to support this use-case.
cc @plkokanov @vlerenc
I added another item
Support control-plane migration for HA shootssince this doesn't seem to work out of the box. We should create a separate issue once we have more certainty and details and find a proper way to support this use-case.cc @plkokanov @vlerenc
Should we (for now) add validation that forbids migration for HA shoots?
/assign @rfranzke
for the tasks related to
Configure high-availability settings for {seed system, shoot control plane, shoot system} components
(except for the Remainders/special cases)
/assign @rfranzke
for the tasks related to
Configure high-availability settings for {seed system, shoot control plane, shoot system} components
(except for the Remainders/special cases)
All related PRs in gardener/gardener have been merged.
/unassign
/assign @plkokanov @ishan16696 for tasks related to
Support control-plane migration for HA shoots
/assign @plkokanov @ishan16696 for tasks related to
Please see the approaches possible to achieve CPM in multi-node etcd: https://github.com/gardener/etcd-druid/issues/479#issuecomment-1365793557
All tasks have been completed. /close
@rfranzke: Closing this issue.
In response to this:
All tasks have been completed. /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.