gatekeeper
gatekeeper copied to clipboard
OPA Gatekeeper brings the cluster to an inoperable state after upgrading/restarting nodes
What steps did you take and what happened:
[A clear and concise description of what the bug is.]
First of all, we decided to use OPA Gatekeeper in our company. Then, we started to use only its Validating
webhook feature. Because we only want to enforce some organizational policies across Kubernetes clusters. We haven't had any issue until we decided to use the Mutating
webhook feature of OPA Gatekeeper that came with version 3.4.0
. Then we upgraded OPA Gatekeeper instances to 3.4.0
and enabled the Mutating
webhook feature, but we did not apply any CRD for mutations. Also, we have nothing but some ConstraintTemplates, and we did not apply any Constraints to the clusters.
As soon as we upgraded it, our SRE teams started to report some issues about OPA Gatekeeper. However, they told us once we remove the Mutating/ValidatingAdmissionWebhook
resources of the OPA Gatekeeper, everything would become normal.
They had issues like the following until they removed Mutating/ValidatingAdmissionWebhook
resources:
- They couldn't see the logs of any Pod in the cluster.
- They couldn't create any Deployment, Pod, etc.
- They didn't communicate with the
kube-apiserver.
We did several tests on different versions of the Kubernetes to re-produce the same issue:
Test 1
Environment
- Kubernetes v1.16.1
- OPA Gatekeeper v3.4.0
Description
We have restarted all the nodes that belong to the cluster and watched the state of the cluster to see what will happen by using the following command:
$ ansible all -m ansible.builtin.command -a "reboot"
As a result;
- Once all the nodes restarted, we encountered a problem at this time. We couldn't reach the API Server even though all the API Server pods have been up and running.
- We couldn't create Pods due to timeout errors.
- All the replicas of the OPA Gatekeeper went into CrashLoopBackOff state.
- We couldn't fetch the logs of the pods. We got an error something like this:
Error from server (InternalError): Internal error occurred: Authorization Error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy
Test 2
Environment
- Kubernetes v1.16.11
- OPA Gatekeeper v3.4.0
Description
By default, Mutating Webhook intercepts all the requests for all types of resources due to the resources
field within the rules
section set as *
like the following:
rules:
- apiGroups:
- '*'
apiVersions:
- '*'
operations:
- CREATE
- UPDATE
resources:
- '*'
So in the second scenario, we narrowed down the scope of the Mutating Webhook by setting the value of resources
fields as pods,
which means that it will only intercept requests for resources which is in type Pod
against CREATE
or UPDATE
events, then we restarted all the nodes.
rules:
- apiGroups:
- '*'
apiVersions:
- '*'
operations:
- CREATE
- UPDATE
resources:
- 'pods'
As a result;
- All the replicas of the OPA Gatekeeper were worked as expected, just restarted a couple of times, did not go to CrashLoopBackOff state.
- We were able to create Pods and fetch logs, etc.
Test 3
Environment
- Kubernetes v1.21.2
- OPA Gatekeeper v3.5.1
Description
We installed the OPA Gatekeeper using its Helm Chart at this time. We did nothing different than that.
We restarted all the nodes that belong to the cluster and watched the state of the cluster to see what will happen by using the following command:
$ ansible all -m ansible.builtin.command -a "reboot"
As a result;
- Once all the nodes restarted, we encountered a problem at this time. We couldn't reach the API Server even though all the API Server pods have been up and running.
- The replicas of the calico-node went into an Unknown state.
- The OPA Gatekeeper's replicas went into an Unknown state due to calico nodes going into an Unknown state.
These are some errors that we saw while inspecting one of the calico pods.
calico-node-x82mp calico-node 2021-07-28 09:37:14.825 [INFO][58] felix/route_table.go 1096: Failed to access interface because it doesn't exist. error=Link not found ifaceName="cali7455238789c" ifaceRegex="^cali.*" ipVersion=0x4
ERROR: Error accessing the Calico datastore: context deadline exceeded
Warning Unhealthy 21m kubelet Liveness probe failed: calico/node is not ready: Felix is not live: Get "http://localhost:9099/liveness": dial tcp 127.0.0.1:9099: connect: connection refused
Warning Unhealthy 21m (x3 over 21m) kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory
Once we remove the ValidatingAdmissionWebhook resource of the OPA Gatekeeper, everything started to work fine.
What did you expect to happen:
We expected to happen is that everything should be work fine.
cc: @dentrax @f9n @necatican
Thank you for the detailed writeup!
One thing I want to highlight: please be careful if you are using mutation in production. It's an alpha feature and may not be stable, so only use if/where you are okay with that caveat.
From my reading of your bug, am I right in assuming:
- Everything works fine if only validation is enabled
- Everything works fine if only mutation is enabled
- Things break if both are enabled
?
Since the failurePolicy
for both webhooks is Ignore
by default, it's a bit confusing that there would be any issues. I wonder if this is due to leader election timeouts.
If you reduce timeoutSeconds
to 1
for both the validating and mutating webhook configurations, does the problem go away?
Assuming this works and the problem is mitigated... the second question would be: why is GK falling over in the first place? It could be a scaling issue or a bootstrapping issue. Are you downing all the nodes in such a way that there are no GK pods running? If so, this may be a bootstrapping issue (API server can't call Gatekeeper b/c it's not running, GK isn't running because there are no nodes).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
Is this still ongoing? I haven't seen any follow-up from my original response.