DaemonSet tries to schedule pods to control plane (master) node even though marked NoSchedule.
When a DaemonSet is created, it creates a Pod for each fake node in the virtual cluster with an affinity rule.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- ip-10-0-1-192.eu-central-1.compute.internal
It appears to do this for all nodes, even when the fake node corresponds to a control plane (master) node which has taint of NoSchedule.
From virtual cluster:
$ k get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-0-1-192.eu-central-1.compute.internal Ready <none> 14m v1.19.5+k3s2 10.100.251.37 <none> Fake Kubernetes Image 4.19.76-fakelinux docker://19.3.12
ip-10-0-1-33.eu-central-1.compute.internal Ready <none> 11m v1.19.5+k3s2 10.97.142.13 <none> Fake Kubernetes Image 4.19.76-fakelinux docker://19.3.12
ip-10-0-1-100.eu-central-1.compute.internal Ready <none> 8m47s v1.19.5+k3s2 10.108.218.223 <none> Fake Kubernetes Image 4.19.76-fakelinux docker://19.3.12
ip-10-0-1-11.eu-central-1.compute.internal Ready <none> 8m47s v1.19.5+k3s2 10.111.189.160 <none> Fake Kubernetes Image 4.19.76-fakelinux docker://19.3.12
From underlying cluster:
$ k get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-0-1-100.eu-central-1.compute.internal Ready <none> 55d v1.20.6+vmware.1 10.0.1.100 <none> Amazon Linux 2 4.14.231-173.361.amzn2.x86_64 containerd://1.4.3
ip-10-0-1-11.eu-central-1.compute.internal Ready <none> 48d v1.20.6+vmware.1 10.0.1.11 <none> Amazon Linux 2 4.14.231-173.361.amzn2.x86_64 containerd://1.4.3
ip-10-0-1-192.eu-central-1.compute.internal Ready control-plane,master 55d v1.20.6+vmware.1 10.0.1.192 <none> Amazon Linux 2 4.14.231-173.361.amzn2.x86_64 containerd://1.4.3
ip-10-0-1-33.eu-central-1.compute.internal Ready <none> 55d v1.20.6+vmware.1 10.0.1.33 <none> Amazon Linux 2 4.14.231-173.361.amzn2.x86_64 containerd://1.4.3
Where the control plane (master) node has:
spec:
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/master
The result is that the Pod assigned for the DaemonSet to the control plane (master) node fails to be run.
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-09-13T05:49:47Z"
message: '0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master:}, that the pod didn''t tolerate, 3 node(s) didn''t match Pod''s node affinity.'
The fake nodes should possibly replicate the taints related to NoSchedule and take that into consideration when creating a Pod for a DaemonSet.
This one will actually be tricky for DaemonSet because what happens if nodes are added/removed, or the NoSchedule taint removed/added. Is the syncer going to be smart enough to know it needs to add/remove pods, or does it only work out what pods to create initially?
Was hoping could say:
--node-selector=!node-role.kubernetes.io/control-plane,!node-role.kubernetes.io/master
but apparently the more general Kubernetes selector format isn't allowed.
I see a logged message of:
syncer F0914 01:30:57.619843 1 leaderelection.go:73] register pods controller: match expressions in the node selector are not supported
which I am presuming is flagging that as not possible. :-(
This is proving problematic and a bit of a blocker on something we are doing. :-(
The problem is that tools like kapp and kapp-controller from Carvel tools (and possibly other deployment tools as well) will monitor the status of everything created from a deployment waiting for it to complete. Because this issue results in a pod which can't be run since gets assigned to control plane (master) node with NoSchedule taint, the deployment tool will eventually timeout with an error since never sees the deployment complete.
@GrahamDumpleton thanks for creating this issue! This might be solved if you are using --fake-nodes=false as flag to start up the vcluster as it will now sync the labels and roles accordingly which should prevent the daemon set from creating a pod on the master node. Please be aware that you will also need to set rbac.clusterRole.create=true or otherwise vcluster won't be able to access the real nodes
This one will actually be tricky for DaemonSet because what happens if nodes are added/removed, or the
NoScheduletaint removed/added. Is the syncer going to be smart enough to know it needs to add/remove pods, or does it only work out what pods to create initially?
I'm not familiar with the syncer internals, but for me the expected behavior would be as close to "vanilla" k8s as possible - taints are only considered when scheduling a pod, if it's already running when a taint is removed/added that would be ignored by the scheduler AFAIK. Same goes for added/removed nodes.
The taints should be synced to the virtual nodes when a mode different than "Fake nodes" is used (Docs), and then the DaemonSet behavior would match the behavior of a vanilla k8s. Also, if the "Fake nodes" mode is used, the control plane nodes are unlikely to be synced to the vcluster(and thus picked up by DaemonSet controller) because vcluster workload pods are unlikely to be scheduled there due to the taints. I think this problem was happening with vcluster before we remove tolerations from the vclusters CoreDNS pod.
Closing as "explained".
P.S.: as for this:
Was hoping could say:
--node-selector=!node-role.kubernetes.io/control-plane,!node-role.kubernetes.io/masterbut apparently the more general Kubernetes selector format isn't allowed.
I will look into this and create a separate issue. If you happen to know where this syntax is documented please comment with a link, that would be very helpful. :pray: