gatekeeper icon indicating copy to clipboard operation
gatekeeper copied to clipboard

Gatekeeper Prevents New Master Node from Joining Cluster

Open kylegoch opened this issue 3 years ago • 15 comments

What steps did you take and what happened: We had been using GateKeeper successfully for a few weeks without issue. However the other day we had to replace a K8s master node, and Gatekeeper actually kept the master node from joining the cluster. By running kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io gatekeeper-validating-webhook-configuration the node was able to join.

We discovered we had not told Gatekeeper to ignore the kube-system namespace. So we updated our config, then intentionally replaced a master. Same issue, master would not join. Worker nodes have no issues joining.

The logs were not overly helpful in kubelet on the master trying to join. There was an error about not being able to reach gatekeeper endpoint, but there were other endpoints errors as well.

Install Method was running kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/master/deploy/gatekeeper.yaml

What did you expect to happen: Gatekeeper to run but not mess with Master nodes.

Environment:

  • Gatekeeper version: v3.1.0-beta.9
  • Kubernetes version: 1.17.9

kylegoch avatar Aug 17 '20 16:08 kylegoch

This is somewhat related to a feature request we have for validating webhooks in kubernetes:

https://github.com/kubernetes/kubernetes/issues/92157

Was the Gatekeeper pod running at the time you upgraded the master node, or was it taken down as part of the upgrade?

What specific resource was it trying to create when this happened?

maxsmythe avatar Aug 18 '20 00:08 maxsmythe

In the short term you can apply the ignore label to the kube-system namespace: https://github.com/open-policy-agent/gatekeeper#exempting-namespaces-from-the-gatekeeper-admission-webhook-using---exempt-namespace-flag

maxsmythe avatar Aug 18 '20 00:08 maxsmythe

Was the Gatekeeper pod running at the time you upgraded the master node, or was it taken down as part of the upgrade?

It was running during the replacement of the master.

What specific resource was it trying to create when this happened?

It appears Rancher was trying to create/update a ServiceAccount and/or ClusterRoleBinding.

In the short term you can apply the ignore label to the kube-system namespace

I actually had done this, but i did not add the --exempt-namespace flag. I went back and added that and still didnt work. Then I went through and exempted/labeled all the Rancher namespaces. Still no luck. I then upgraded to v3.1.0-rc.1 and still no luck there either.

Again, running the emergency removal command allowed the test Master I had made to join the cluster. There might be some other namespace Rancher uses, but I have them all ignored at the moment.

kylegoch avatar Aug 19 '20 13:08 kylegoch

It appears Rancher was trying to create/update a ServiceAccount and/or ClusterRoleBinding.

Do you see any errors/events generated due to failure to create or update these resources? Do you have any constraints in the cluster at the time of this failure? If so, can you pls share?

One thing to note is ClusterRoleBinding is not a namespaced resource, so it's possible that exempting namespaces has no effect. From k8s docs

The namespaceSelector decides whether to run the webhook on a request for a namespaced resource (or a Namespace object), based on whether the namespace's labels match the selector. If the object itself is a namespace, the matching is performed on object.metadata.labels. If the object is a cluster scoped resource other than a Namespace, namespaceSelector has no effect.

ritazh avatar Aug 19 '20 15:08 ritazh

+1 to the possibility of some cluster-scoped resource causing issues.

Do you know if the validating webhook was affirmatively rejecting requests, or is this rejection due to timeouts?

Did the ValidatingWebhookConfiguration have a timeout set? Was it set to fail open?

maxsmythe avatar Aug 19 '20 22:08 maxsmythe

What is the solution to this? Is there a workaround that can be used for now?

gravufo avatar May 28 '21 18:05 gravufo

Unfortunately we don't have enough information right now.

  • Is this a general problem for all clusters, or just Rancher?
  • What resource requests are leading to problems calling the webhook? GVK, Namespace, Name?
  • Why is the webhook unable to be reached even when the webhook pods are running?
  • Is the webhook configured to fail open and with a short enough timeout?
  • Follow up questions

maxsmythe avatar Jun 02 '21 15:06 maxsmythe

We had a similar issue on GKE, the webhook timeout prevented any node from joining the cluster. (This cluster was on preemptible instances so at some point all nodes may have been removed)

After deleting both weebhooks (gatekeeper-mutating-webhook-configuration and gatekeeper-validating-webhook-configuration), the cluster recovered. The cluster is on 1.19.13-gke.1900 for the control plane and nodes. This also caused GKE triggering node pool auto repair but it was unable to self repair.

Errors from kubelet: https://{gke-ip}/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/{node-id}?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers) and Unable to register node "{node-id}" with API server: Post "https://{gke-ip}/api/v1/nodes": read tcp 10.128.0.37:60220->{gke-ip}:443: use of closed network connection

also causes Unable to update cni config: no networks found in /etc/cni/net.d

Gentoli avatar Oct 29 '21 05:10 Gentoli

This is interesting, that call lists a timeout of 10 seconds. Gatekeeper's validation webhook has a timeout of 2 seconds (though it was higher previously). What is your current validating webhook timeout? Are you running the mutation webhook?

maxsmythe avatar Oct 30 '21 02:10 maxsmythe

We have v3.4.0, the mutating and validating webhook had 30 and 3 seconds timeout respectively.

Gentoli avatar Oct 31 '21 05:10 Gentoli

I have experienced same problem with master nodes in BYO K8s cluster with Gatekeeper v3.7.0.

  • I did not have any issues with adding/removing worker nodes, only master nodes do not register.

Setup

  • gatekeeper pods are up and running
  • there are 3 masters running which were setup before gatekeeper was deployed
  • new master node is added and it fails to register

Logs

Master node Kubelet log shows a lot of errors like this:

Jan 18 21:24:07 master-node-name kubelet[24315]: E0118 21:24:07.559174   24315 kubelet.go:2212] Container runtime network not ready: NetworkReady=false reason:Net
workPluginNotReady message:docker: network plugin is not ready: kubenet does not have netConfig. This is most likely due to lack of PodCIDR
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.349627   24315 kubelet.go:2292] node "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.449800   24315 kubelet.go:2292] node "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.550034   24315 kubelet.go:2292] node "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.650209   24315 kubelet.go:2292] node "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: I0118 21:24:11.655041   24315 nodeinfomanager.go:403] Failed to publish CSINode: nodes "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: I0118 21:24:11.670474   24315 nodeinfomanager.go:403] Failed to publish CSINode: nodes "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: I0118 21:24:11.731168   24315 nodeinfomanager.go:403] Failed to publish CSINode: nodes "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.750440   24315 kubelet.go:2292] node "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.850677   24315 kubelet.go:2292] node "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.950879   24315 kubelet.go:2292] node "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: I0118 21:24:11.991486   24315 nodeinfomanager.go:403] Failed to publish CSINode: nodes "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: E0118 21:24:11.991519   24315 csi_plugin.go:285] Failed to initialize CSINode: error updating CSINode annotation:
 timed out waiting for the condition; caused by: nodes "master-node-name" not found
Jan 18 21:24:11 master-node-name kubelet[24315]: F0118 21:24:11.991525   24315 csi_plugin.go:299] Failed to initialize CSINode after retrying: timed out waiting f
or the condition
Jan 18 21:24:12 master-node-name systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a

Gatekeeper helm config:

replicas: 3
auditInterval: 300
auditMatchKindOnly: true
constraintViolationsLimit: 20
auditFromCache: false
disableMutation: false
disableValidatingWebhook: false
validatingWebhookTimeoutSeconds: 30
validatingWebhookFailurePolicy: Fail
validatingWebhookCheckIgnoreFailurePolicy: Fail
enableDeleteOperations: false
enableExternalData: false
mutatingWebhookFailurePolicy: Fail
mutatingWebhookTimeoutSeconds: 30
auditChunkSize: 100
logLevel: INFO
logDenies: true
emitAdmissionEvents: true
emitAuditEvents: true
resourceQuota: true

darkstarmv avatar Jan 18 '22 22:01 darkstarmv

@darkstarmv Do you have any constraints on your cluster?

Since you are logging denies, do you see any denies on any G8r webhook pods that coincide with trying to spin up the master?

maxsmythe avatar Jan 19 '22 02:01 maxsmythe

@darkstarmv Do you have any constraints on your cluster?

Since you are logging denies, do you see any denies on any G8r webhook pods that coincide with trying to spin up the master?

I have a couple of ingress constraints to block duplicate ingress host and localhost. I'll try to dig through logs some more

darkstarmv avatar Feb 04 '22 14:02 darkstarmv

Might also be worth seeing if disabling the webhook temporarily solves the problem. Not 100% clear G8r is involved, only that it's installed on the cluster. If disabling the webhook doesn't clear the issue, it's likely not G8r.

maxsmythe avatar Feb 05 '22 00:02 maxsmythe

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 23 '22 02:07 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 11 '22 05:10 stale[bot]