vcluster icon indicating copy to clipboard operation
vcluster copied to clipboard

Synced nodes should ommit taints that have an enforce toleration

Open martinnirtl opened this issue 3 years ago • 6 comments

Is your feature request related to a problem?

Yes. Tolerations are enforced on a pod level, which works nicely for Deployment and StatefulSet kinds. However, having synched nodes with a custom taint on every node results in 0 pods running for a given DaemonSet as the DaemonSet controller evaluates the DaemonSet with no toleration on it (the toleration would get synced on the pod level). As a consequence there are no nodes that would fulfil/tolerate the workload.

Which solution do you suggest?

Edited by @matskiv after discussion in the comments below.

Node syncer should not sync those taints from the host node, for which the vcluster has enforced tolerations defined. So the --enforce-toleration=key1=value1:NoSchedule would mean that the key1=value1:NoSchedule taint would not be synced to the virtual node in vcluster.

Original: ~~Enforce tolerations via a mutating webhook instead of the syncer's enforce-toleration flag or potentially allow opt-in to enforce tolerations via a mutating webhook. Adding a mutating webhook out of the box, would allow greater flexibility (support for DaemonSets) as well as transparency (toleration gets added on workload-level instead of pod-level).~~

Which alternative solutions exist?

Manually apply the toleration for DaemonSets.

Additional context

The goal is to have vcluster operate on dedicated, real/synced nodes.

martinnirtl avatar Mar 23 '22 20:03 martinnirtl

@martinnirtl thanks for creating this issue! Couldn't you just set the tolerations also on the daemon set and even though they don't get synced that would work as they would be enforced through the flag?

In general, we are planning to add a new mode to vcluster where the scheduler runs inside vcluster as well, which would allow you to set taints and tolerations on nodes and workloads inside the vcluster without effecting the nodes of the host cluster, which should also solve your problem.

FabianKramm avatar Mar 24 '22 09:03 FabianKramm

Well, this feature request came to my mind when i was trying to create a vcluster with dedicated and real nodes. I wanted to keep any "external" workload away from being scheduled on the vcluster's nodes to get as close as possible to a real cluster.

I agree that one could simply put the toleration on the Daemonset, but I would prefer to get this functionality from something like a toleration controller (mutating webook that adds the vcluster tolerations to any workload type), which would only manage those tolerations that are there for "protecting" the vcluster nodes.

I was thinking that this feature can maybe be implemented as a plugin and I would even be up for implementing it by myself. So, feel free to close this request and i will just get back to you once i am done with it :)) maybe somebody will be looking for this as well in the future

martinnirtl avatar Mar 29 '22 18:03 martinnirtl

As an alternative to a custom vcluster plugin, I can suggest to use a Kubernetes policy engine, like Kyverno or jsPolicy, to dynamically patch the pods managed by vcluster with a mutating web hook.

I had similar needs and this is what I ended up doing. It seems to work pretty well so far, including with daemonsets.

My use case was to have certain taints and tolerations that are unique per namespace, so that I can have dedicated node pools that belong to virtual clusters.

I'm using the following Kyverno policy:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: vcluster-node-toleration
spec:
  rules:
    - name: patch-pod-with-node-toleration
      match:
        resources:
          kinds:
            - Pod
          selector:
            matchLabels:
              vcluster.loft.sh/managed-by: vcluster
      mutate:
        patchesJson6902: |-
          - path: "/spec/tolerations/-1"
            op: add
            value: {"key": "vcluster-node","value": "{{request.object.metadata.namespace}}", "effect": "NoSchedule"}

Note that I'm using the namespace as the taint value to identify the vcluster.

alexandrem avatar Apr 05 '22 01:04 alexandrem

There could be an alternative solution - when particular toleration is enforced by vcluster, then the nodes synced by vcluster should have the taints modified based on the enforced toleration. So --enforce-toleration=key1=value1:NoSchedule would mean that key1=value1:NoSchedule taint would not be synced to the virtual node in vcluster. @FabianKramm wdyt?

matskiv avatar Aug 02 '22 09:08 matskiv

Hey @matskiv! from my point of view, this would be the most elegant solution for the use case and actually kind of completing the --enforce-toleration feature. also, it would eliminate the need of mutations from within the vcluster..

martinnirtl avatar Aug 02 '22 11:08 martinnirtl

Hi @FabianKramm and @matskiv,

I would like to work on this issue. Can you please assign it to me.

Thanks

neogopher avatar Aug 03 '22 17:08 neogopher