gatekeeper
gatekeeper copied to clipboard
Gatekeeper Prevents New Master Node from Joining Cluster
What steps did you take and what happened: We had been using GateKeeper successfully for a few weeks without issue. However the other day we had to replace a K8s master node, and Gatekeeper actually kept the master node from joining the cluster. By running kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io gatekeeper-validating-webhook-configuration the node was able to join.
We discovered we had not told Gatekeeper to ignore the kube-system namespace. So we updated our config, then intentionally replaced a master. Same issue, master would not join. Worker nodes have no issues joining.
The logs were not overly helpful in kubelet on the master trying to join. There was an error about not being able to reach gatekeeper endpoint, but there were other endpoints errors as well.
Install Method was running kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/master/deploy/gatekeeper.yaml
What did you expect to happen: Gatekeeper to run but not mess with Master nodes.
Environment:
- Gatekeeper version: v3.1.0-beta.9
- Kubernetes version: 1.17.9
This is somewhat related to a feature request we have for validating webhooks in kubernetes:
https://github.com/kubernetes/kubernetes/issues/92157
Was the Gatekeeper pod running at the time you upgraded the master node, or was it taken down as part of the upgrade?
What specific resource was it trying to create when this happened?
In the short term you can apply the ignore label to the kube-system namespace: https://github.com/open-policy-agent/gatekeeper#exempting-namespaces-from-the-gatekeeper-admission-webhook-using---exempt-namespace-flag
Was the Gatekeeper pod running at the time you upgraded the master node, or was it taken down as part of the upgrade?
It was running during the replacement of the master.
What specific resource was it trying to create when this happened?
It appears Rancher was trying to create/update a ServiceAccount and/or ClusterRoleBinding.
In the short term you can apply the ignore label to the kube-system namespace
I actually had done this, but i did not add the --exempt-namespace
flag. I went back and added that and still didnt work. Then I went through and exempted/labeled all the Rancher namespaces. Still no luck. I then upgraded to v3.1.0-rc.1
and still no luck there either.
Again, running the emergency removal command allowed the test Master I had made to join the cluster. There might be some other namespace Rancher uses, but I have them all ignored at the moment.
It appears Rancher was trying to create/update a ServiceAccount and/or ClusterRoleBinding.
Do you see any errors/events generated due to failure to create or update these resources? Do you have any constraints in the cluster at the time of this failure? If so, can you pls share?
One thing to note is ClusterRoleBinding
is not a namespaced resource, so it's possible that exempting namespaces has no effect. From k8s docs
The namespaceSelector decides whether to run the webhook on a request for a namespaced resource (or a Namespace object), based on whether the namespace's labels match the selector. If the object itself is a namespace, the matching is performed on object.metadata.labels. If the object is a cluster scoped resource other than a Namespace, namespaceSelector has no effect.
+1 to the possibility of some cluster-scoped resource causing issues.
Do you know if the validating webhook was affirmatively rejecting requests, or is this rejection due to timeouts?
Did the ValidatingWebhookConfiguration have a timeout set? Was it set to fail open?
What is the solution to this? Is there a workaround that can be used for now?
Unfortunately we don't have enough information right now.
- Is this a general problem for all clusters, or just Rancher?
- What resource requests are leading to problems calling the webhook? GVK, Namespace, Name?
- Why is the webhook unable to be reached even when the webhook pods are running?
- Is the webhook configured to fail open and with a short enough timeout?
- Follow up questions
We had a similar issue on GKE, the webhook timeout prevented any node from joining the cluster. (This cluster was on preemptible instances so at some point all nodes may have been removed)
After deleting both weebhooks (gatekeeper-mutating-webhook-configuration
and gatekeeper-validating-webhook-configuration
), the cluster recovered. The cluster is on 1.19.13-gke.1900
for the control plane and nodes. This also caused GKE triggering node pool auto repair but it was unable to self repair.
Errors from kubelet
:
https://{gke-ip}/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/{node-id}?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
and
Unable to register node "{node-id}" with API server: Post "https://{gke-ip}/api/v1/nodes": read tcp 10.128.0.37:60220->{gke-ip}:443: use of closed network connection
also causes Unable to update cni config: no networks found in /etc/cni/net.d
This is interesting, that call lists a timeout of 10 seconds. Gatekeeper's validation webhook has a timeout of 2 seconds (though it was higher previously). What is your current validating webhook timeout? Are you running the mutation webhook?
We have v3.4.0
, the mutating and validating webhook had 30
and 3
seconds timeout respectively.
I have experienced same problem with master nodes in BYO K8s cluster with Gatekeeper v3.7.0.
- I did not have any issues with adding/removing worker nodes, only master nodes do not register.
Setup
- gatekeeper pods are up and running
- there are 3 masters running which were setup before gatekeeper was deployed
- new master node is added and it fails to register
Logs
Master node Kubelet log shows a lot of errors like this:
Jan 18 21:24:07 master-node-name kubelet[24315]: E0118 21:24:07.559174 24315 kubelet.go:2212] Container runtime network not ready: NetworkReady=false reason:Net
workPluginNotReady message:docker: network plugin is not ready: kubenet does not have netConfig. This is most likely due to lack of PodCIDR
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.349627 24315 kubelet.go:2292] node "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.449800 24315 kubelet.go:2292] node "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.550034 24315 kubelet.go:2292] node "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.650209 24315 kubelet.go:2292] node "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: I0118 21:24:11.655041 24315 nodeinfomanager.go:403] Failed to publish CSINode: nodes "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: I0118 21:24:11.670474 24315 nodeinfomanager.go:403] Failed to publish CSINode: nodes "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: I0118 21:24:11.731168 24315 nodeinfomanager.go:403] Failed to publish CSINode: nodes "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.750440 24315 kubelet.go:2292] node "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.850677 24315 kubelet.go:2292] node "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.950879 24315 kubelet.go:2292] node "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: I0118 21:24:11.991486 24315 nodeinfomanager.go:403] Failed to publish CSINode: nodes "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.991519 24315 csi_plugin.go:285] Failed to initialize CSINode: error updating CSINode annotation:
timed out waiting for the condition; caused by: nodes "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: F0118 21:24:11.991525 24315 csi_plugin.go:299] Failed to initialize CSINode after retrying: timed out waiting f
or the condition
Jan 18 21:24:12 master-node-name systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
Gatekeeper helm config:
replicas: 3
auditInterval: 300
auditMatchKindOnly: true
constraintViolationsLimit: 20
auditFromCache: false
disableMutation: false
disableValidatingWebhook: false
validatingWebhookTimeoutSeconds: 30
validatingWebhookFailurePolicy: Fail
validatingWebhookCheckIgnoreFailurePolicy: Fail
enableDeleteOperations: false
enableExternalData: false
mutatingWebhookFailurePolicy: Fail
mutatingWebhookTimeoutSeconds: 30
auditChunkSize: 100
logLevel: INFO
logDenies: true
emitAdmissionEvents: true
emitAuditEvents: true
resourceQuota: true
@darkstarmv Do you have any constraints on your cluster?
Since you are logging denies, do you see any denies on any G8r webhook pods that coincide with trying to spin up the master?
@darkstarmv Do you have any constraints on your cluster?
Since you are logging denies, do you see any denies on any G8r webhook pods that coincide with trying to spin up the master?
I have a couple of ingress constraints to block duplicate ingress host and localhost. I'll try to dig through logs some more
Might also be worth seeing if disabling the webhook temporarily solves the problem. Not 100% clear G8r is involved, only that it's installed on the cluster. If disabling the webhook doesn't clear the issue, it's likely not G8r.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.