vcluster drain

I've been playing around with vcluster a little bit. Very cool tool. I think there may be an unforeseen difficulty with it from the management of the underlying cluster standpoint. We typically drain each node, one at a time to upgrade the cluster. But when testing out that functionality when a vcluster was running:

cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): default/coredns-85cb69466-cljmm-x-kube-system-x-test

It looks like the owner reference is copied, but the underlying k8s seems to ignore it (perhaps since it doesnt exist). Forcing a drain is not a particularly safe operation.

I don't know if there is an annotation saying a pod is actually managed and allow it to be deleted. If such a feature doesn't exist it probably should. Maybe a feature request to k8s is in order. Maybe there are other ways of fixing it too.

Along with this, a complimentary feature would be pod disruption budget syncing if that isnt already supported. That way the over cluster could safely inform the undercluser what was safe to drain.

Jan 06 '22 01:01 kfox1111

https://github.com/kubernetes/kubernetes/issues/57049 is related.

Jan 06 '22 01:01 kfox1111

@kfox1111 thanks for creating this issue! Yes vcluster is managing those pods, for newer versions of vcluster v0.6.0 and onwards, you could pause the vcluster with vcluster pause ... and then drain the nodes, upgrade them and then resume the vcluster. Other than that I'm not sure if there is much we can do.

Jan 06 '22 08:01 FabianKramm

Pausing wouldn't help as it would mostly have the same effect. Making the upper clusers workload unavailable. It also breaks the usage contract.

Let me go though an concrete example that might help describe the use case/problem. You might know a lot of this, but going to go through it anyway in case there are other readers.

Some organizations are starting to provide soft multitenant clusters so that groups within the organization can start using Kubernetes as users without having to learn all the complexities of managing a cluster properly. This largely relies on the separation of the cluster-admin role and admin (namespace, workload related) role. The cluster-admin is a different person (or group of people) from the workload administrators.

Cluster admins goal is to provide production class clusters. This includes applying software updates (especially security related) in a timely manner. As there are often significantly more workload admins then cluster admins they can't afford to talk to each workload admin and ensure things are "safe to upgrade" before performing upgrades of the cluster. So the api contract provided via Kubernetes between the cluster admin and workload admin is very important. Things like being able to properly kubectl drain a node along with PodDisruptionBudgets are quite important. In increasing levels of automation, cluster admins will delegate the ability to do the upgrades to a controller to perform the upgrades automatically. The contract becomes even more important in this environment.

vcluster's claim to fame over other solutions such as cluster-api-provisioner-nested is that it doesn't require assistance from the cluster-admin. This means there is a pretty hard requirement that the cluster-admin and vcluster admin are different people (or teams) by design. This is one place where pausing doesn't work. It requires the cluster-admin to do stuff for the workload admin, or contact them, breaking automation and not scaling. If this contract is broken, cluster admins may take steps to actively block vcluster, which is not in vclusters best interest.

The workload admins are also increasing levels of automation (a good thing). Lets take a specific example, the elastico-operator. It deploys and manages a highly available elasticsearch cluster on top of Kubernetes on behalf of the user. This is a really good thing and really showcases what Kubernetes can accomplish. The workload admin loads a description of the elasticsearch cluster, and the operator ensures its proper operation. The worlkoad owner may not even have a firm understanding of operating the workload safely. That's delegated to the operator. So even if the cluster-admin could contact the workload admin, they wouldn't know what to do. The operator does clever stuff such as watching the state of the elasticsearch cluster. Its able to have one pod down while the whole cluster is still fully functional. Having another pod down would break the cluster though. So the operator watches the internal state of the elasticsearch cluster, and adjusts Kubernetes pod disruption budgets to ensure that drains are safe when the elasticsearch cluster is fully synchronized/healthy and blocks any drains until the cluster becomes safe to do so again. Simply having all pods up isn't enough to guarantee synchronization and therefore safety to delete a pod.

This all establishes a proper handshake between the automation doing the rolling upgrades and the automation managing the workload. Automatic node upgrades happen so the cluster-admin's life is much easier. The workload admin doesn't need to know details of how it works, and the whole time, a highly available elasticsearch cluster performs it work without downtime even though nodes are pulled out from under it while in production. Life's great right?

So, one of the rubs and where vcluster again shines: these workload related delegations, operators, typically needs CRD's to be loaded into the cluster and thats a privileged operation so requires cluster-admin. Thats time consuming for the cluster-admin and risky as its global, so there is pushback on adding too many. vcluster provides a good solution to the issue. Allow the workload administrator to have their own control plane so all operators are now available to them. They can load whatever operators they want within the nested cluster. Now there is a pure separation between the cluster admin and workload admin.

We're almost to a great place now, but this contract is currently broken in vcluster, so the cluster admin maintaining a production cluster on the undercluster will be doing what they need to do and the broken api contract will force them to either take significantly more time to do their work costing a lot, breaking workload, or taking steps to block vcluster workloads. And users will blame the cluster-admins either for saying no, or "breaking their workload". This causes trouble for everyone involved.

So, this is why I think its so important to find a solution to this issue. If it takes adding some functionality to Kubernetes, we need to start the process soon. It takes a while to get features into Kubernetes so the sooner we start the sooner we can solve it.

Jan 06 '22 17:01 kfox1111

@kfox1111 Thanks for a detailed use case description, it's very useful.

I had a look into kubectl drain implementation, and the bit(part A, part B) that is preventing the eviction of pods created by the vcluster is looking for the controller: true in the ownerReferences. So once the pods created by vcluster have controller: true in ownerReferences, kubectl drain works, see logs from my test:

$ kubectl get pod nginx-deployment-66b6c48dd5-fszdf-x-test-x-vcluster -o jsonpath-as-json='{.metadata.ownerReferences}' [[{ "apiVersion": "v1", "controller": true, "kind": "Service", "name": "vcluster", "uid": "59ac17b4-9645-4a80-b31f-ed8a801c8b2c" }]] $ kubectl drain minikube-m02 node/minikube-m02 cordoned evicting pod vcluster/nginx-deployment-66b6c48dd5-fszdf-x-test-x-vcluster pod/nginx-deployment-66b6c48dd5-fszdf-x-test-x-vcluster evicted node/minikube-m02 drained

Vcluster already sets the owner reference on the pods, but it uses the owner of the vcluster pod as the owner of the spawned pods, and that owner is either a StatefulSet or a ReplicaSet/Deployment (depending on vcluster flavor k3s/k8s/k0s). Adding controller: true to ownerReference with kind: StatefulSet doesn't work, such ownerReference gets removed. @FabianKramm suggested to me that we use the "vcluster" Service as the owner, as that will simplify syncer code(same for all flavors) and allow us to add controller: true on the resources that have this value in the vcluster.

So the plan is to switch to the "vcluster" Service as the owner, and add controller: true to the synced resources that have the same value inside of the vcluster. An update of the "--set-owner" flag description would also be needed. @kfox1111 if you are interested in contributing a PR with these changes, then we would be more than happy to provide assistance with that, preferably via our Slack instance - feel free to ping me there :). And if not, that's absolutely fine, we should get around to making this change very soon.

Once we make the change described above, and node draining works, it becomes really important to honor the PodDisruptionBudgets created by users in their clusters. We have an existing issue to support that - #72 . IMHO we should land the fix for this issue together with PodDisruptionBudget support, definitely in the same minor release of vcluster.

Jan 07 '22 11:01 matskiv

Did pdb syncing ever make it in?

Sep 22 '22 16:09 kfox1111

@kfox1111 yes, you can enable it with this helm value:

sync:
  poddisruptionbudgets:
    enabled: true

Sep 22 '22 17:09 matskiv

Awesome. Thanks! :)

Sep 22 '22 17:09 kfox1111