vcluster DaemonSet tries to schedule pods to control plane (master) node even though marked NoSchedule.

When a DaemonSet is created, it creates a Pod for each fake node in the virtual cluster with an affinity rule.

   affinity:
     nodeAffinity:
       requiredDuringSchedulingIgnoredDuringExecution:
         nodeSelectorTerms:                                             
         - matchFields:
           - key: metadata.name
             operator: In
             values:
             - ip-10-0-1-192.eu-central-1.compute.internal

It appears to do this for all nodes, even when the fake node corresponds to a control plane (master) node which has taint of NoSchedule.

From virtual cluster:

$ k get nodes -o wide
NAME                                          STATUS   ROLES    AGE     VERSION        INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION      CONTAINER-RUNTIME
ip-10-0-1-192.eu-central-1.compute.internal   Ready    <none>   14m     v1.19.5+k3s2   10.100.251.37    <none>        Fake Kubernetes Image   4.19.76-fakelinux   docker://19.3.12
ip-10-0-1-33.eu-central-1.compute.internal    Ready    <none>   11m     v1.19.5+k3s2   10.97.142.13     <none>        Fake Kubernetes Image   4.19.76-fakelinux   docker://19.3.12
ip-10-0-1-100.eu-central-1.compute.internal   Ready    <none>   8m47s   v1.19.5+k3s2   10.108.218.223   <none>        Fake Kubernetes Image   4.19.76-fakelinux   docker://19.3.12
ip-10-0-1-11.eu-central-1.compute.internal    Ready    <none>   8m47s   v1.19.5+k3s2   10.111.189.160   <none>        Fake Kubernetes Image   4.19.76-fakelinux   docker://19.3.12

From underlying cluster:

$ k get nodes -o wide
NAME                                          STATUS   ROLES                  AGE   VERSION            INTERNAL-IP   EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-10-0-1-100.eu-central-1.compute.internal   Ready    <none>                 55d   v1.20.6+vmware.1   10.0.1.100    <none>        Amazon Linux 2   4.14.231-173.361.amzn2.x86_64   containerd://1.4.3
ip-10-0-1-11.eu-central-1.compute.internal    Ready    <none>                 48d   v1.20.6+vmware.1   10.0.1.11     <none>        Amazon Linux 2   4.14.231-173.361.amzn2.x86_64   containerd://1.4.3
ip-10-0-1-192.eu-central-1.compute.internal   Ready    control-plane,master   55d   v1.20.6+vmware.1   10.0.1.192    <none>        Amazon Linux 2   4.14.231-173.361.amzn2.x86_64   containerd://1.4.3
ip-10-0-1-33.eu-central-1.compute.internal    Ready    <none>                 55d   v1.20.6+vmware.1   10.0.1.33     <none>        Amazon Linux 2   4.14.231-173.361.amzn2.x86_64   containerd://1.4.3

Where the control plane (master) node has:

spec:
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master

The result is that the Pod assigned for the DaemonSet to the control plane (master) node fails to be run.

status:
   conditions: 
   - lastProbeTime: null
     lastTransitionTime: "2021-09-13T05:49:47Z"                         
     message: '0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master:}, that the pod didn''t tolerate, 3 node(s) didn''t match Pod''s node affinity.'

The fake nodes should possibly replicate the taints related to NoSchedule and take that into consideration when creating a Pod for a DaemonSet.

Sep 13 '21 06:09 GrahamDumpleton

This one will actually be tricky for DaemonSet because what happens if nodes are added/removed, or the NoSchedule taint removed/added. Is the syncer going to be smart enough to know it needs to add/remove pods, or does it only work out what pods to create initially?

Sep 13 '21 20:09 GrahamDumpleton

Was hoping could say:

--node-selector=!node-role.kubernetes.io/control-plane,!node-role.kubernetes.io/master

but apparently the more general Kubernetes selector format isn't allowed.

I see a logged message of:

syncer F0914 01:30:57.619843       1 leaderelection.go:73] register pods controller: match expressions in the node selector are not supported

which I am presuming is flagging that as not possible. :-(

Sep 14 '21 01:09 GrahamDumpleton

This is proving problematic and a bit of a blocker on something we are doing. :-(

The problem is that tools like kapp and kapp-controller from Carvel tools (and possibly other deployment tools as well) will monitor the status of everything created from a deployment waiting for it to complete. Because this issue results in a pod which can't be run since gets assigned to control plane (master) node with NoSchedule taint, the deployment tool will eventually timeout with an error since never sees the deployment complete.

Sep 14 '21 01:09 GrahamDumpleton

@GrahamDumpleton thanks for creating this issue! This might be solved if you are using --fake-nodes=false as flag to start up the vcluster as it will now sync the labels and roles accordingly which should prevent the daemon set from creating a pod on the master node. Please be aware that you will also need to set rbac.clusterRole.create=true or otherwise vcluster won't be able to access the real nodes

Sep 14 '21 06:09 FabianKramm

This one will actually be tricky for DaemonSet because what happens if nodes are added/removed, or the NoSchedule taint removed/added. Is the syncer going to be smart enough to know it needs to add/remove pods, or does it only work out what pods to create initially?

I'm not familiar with the syncer internals, but for me the expected behavior would be as close to "vanilla" k8s as possible - taints are only considered when scheduling a pod, if it's already running when a taint is removed/added that would be ignored by the scheduler AFAIK. Same goes for added/removed nodes.

Jun 29 '22 07:06 mld

The taints should be synced to the virtual nodes when a mode different than "Fake nodes" is used (Docs), and then the DaemonSet behavior would match the behavior of a vanilla k8s. Also, if the "Fake nodes" mode is used, the control plane nodes are unlikely to be synced to the vcluster(and thus picked up by DaemonSet controller) because vcluster workload pods are unlikely to be scheduled there due to the taints. I think this problem was happening with vcluster before we remove tolerations from the vclusters CoreDNS pod.

Closing as "explained".

P.S.: as for this:

Was hoping could say: --node-selector=!node-role.kubernetes.io/control-plane,!node-role.kubernetes.io/master but apparently the more general Kubernetes selector format isn't allowed.

I will look into this and create a separate issue. If you happen to know where this syntax is documented please comment with a link, that would be very helpful. :pray:

Nov 07 '22 10:11 matskiv