gatekeeper icon indicating copy to clipboard operation
gatekeeper copied to clipboard

OPA Gatekeeper brings the cluster to an inoperable state after upgrading/restarting nodes

Open developer-guy opened this issue 2 years ago • 3 comments

What steps did you take and what happened: [A clear and concise description of what the bug is.] First of all, we decided to use OPA Gatekeeper in our company. Then, we started to use only its Validating webhook feature. Because we only want to enforce some organizational policies across Kubernetes clusters. We haven't had any issue until we decided to use the Mutating webhook feature of OPA Gatekeeper that came with version 3.4.0. Then we upgraded OPA Gatekeeper instances to 3.4.0 and enabled the Mutating webhook feature, but we did not apply any CRD for mutations. Also, we have nothing but some ConstraintTemplates, and we did not apply any Constraints to the clusters.

As soon as we upgraded it, our SRE teams started to report some issues about OPA Gatekeeper. However, they told us once we remove the Mutating/ValidatingAdmissionWebhook resources of the OPA Gatekeeper, everything would become normal.

They had issues like the following until they removed Mutating/ValidatingAdmissionWebhook resources:

  • They couldn't see the logs of any Pod in the cluster.
  • They couldn't create any Deployment, Pod, etc.
  • They didn't communicate with the kube-apiserver.

We did several tests on different versions of the Kubernetes to re-produce the same issue:

Test 1

Environment

  • Kubernetes v1.16.1
  • OPA Gatekeeper v3.4.0

Description

We have restarted all the nodes that belong to the cluster and watched the state of the cluster to see what will happen by using the following command:

$ ansible all -m ansible.builtin.command -a "reboot" 

As a result;

  • Once all the nodes restarted, we encountered a problem at this time. We couldn't reach the API Server even though all the API Server pods have been up and running.
  • We couldn't create Pods due to timeout errors.
  • All the replicas of the OPA Gatekeeper went into CrashLoopBackOff state.
  • We couldn't fetch the logs of the pods. We got an error something like this:
Error from server (InternalError): Internal error occurred: Authorization Error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy

oM1yx8n_2

Test 2

Environment

  • Kubernetes v1.16.11
  • OPA Gatekeeper v3.4.0

Description

By default, Mutating Webhook intercepts all the requests for all types of resources due to the resources field within the rules section set as * like the following:

  rules:
  - apiGroups:
    - '*'
    apiVersions:
    - '*'
    operations:
    - CREATE
    - UPDATE
    resources:
    - '*'

So in the second scenario, we narrowed down the scope of the Mutating Webhook by setting the value of resources fields as pods, which means that it will only intercept requests for resources which is in type Pod against CREATE or UPDATE events, then we restarted all the nodes.

  rules:
  - apiGroups:
    - '*'
    apiVersions:
    - '*'
    operations:
    - CREATE
    - UPDATE
    resources:
    - 'pods'

As a result;

  • All the replicas of the OPA Gatekeeper were worked as expected, just restarted a couple of times, did not go to CrashLoopBackOff state.
  • We were able to create Pods and fetch logs, etc.

SXSDbHx_2

Test 3

Environment

  • Kubernetes v1.21.2
  • OPA Gatekeeper v3.5.1

Description

We installed the OPA Gatekeeper using its Helm Chart at this time. We did nothing different than that.

We restarted all the nodes that belong to the cluster and watched the state of the cluster to see what will happen by using the following command:

$ ansible all -m ansible.builtin.command -a "reboot"

As a result;

  • Once all the nodes restarted, we encountered a problem at this time. We couldn't reach the API Server even though all the API Server pods have been up and running.
  • The replicas of the calico-node went into an Unknown state.
  • The OPA Gatekeeper's replicas went into an Unknown state due to calico nodes going into an Unknown state.

These are some errors that we saw while inspecting one of the calico pods.

calico-node-x82mp calico-node 2021-07-28 09:37:14.825 [INFO][58] felix/route_table.go 1096: Failed to access interface because it doesn't exist. error=Link not found ifaceName="cali7455238789c" ifaceRegex="^cali.*" ipVersion=0x4

ERROR: Error accessing the Calico datastore: context deadline exceeded

Warning  Unhealthy       21m                kubelet  Liveness probe failed: calico/node is not ready: Felix is not live: Get "http://localhost:9099/liveness": dial tcp 127.0.0.1:9099: connect: connection refused

Warning  Unhealthy       21m (x3 over 21m)  kubelet  Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory

Once we remove the ValidatingAdmissionWebhook resource of the OPA Gatekeeper, everything started to work fine.

What did you expect to happen:

We expected to happen is that everything should be work fine.

developer-guy avatar Aug 03 '21 10:08 developer-guy

cc: @dentrax @f9n @necatican

developer-guy avatar Aug 03 '21 10:08 developer-guy

Thank you for the detailed writeup!

One thing I want to highlight: please be careful if you are using mutation in production. It's an alpha feature and may not be stable, so only use if/where you are okay with that caveat.

From my reading of your bug, am I right in assuming:

  • Everything works fine if only validation is enabled
  • Everything works fine if only mutation is enabled
  • Things break if both are enabled

?

Since the failurePolicy for both webhooks is Ignore by default, it's a bit confusing that there would be any issues. I wonder if this is due to leader election timeouts.

If you reduce timeoutSeconds to 1 for both the validating and mutating webhook configurations, does the problem go away?

Assuming this works and the problem is mitigated... the second question would be: why is GK falling over in the first place? It could be a scaling issue or a bootstrapping issue. Are you downing all the nodes in such a way that there are no GK pods running? If so, this may be a bootstrapping issue (API server can't call Gatekeeper b/c it's not running, GK isn't running because there are no nodes).

maxsmythe avatar Aug 03 '21 20:08 maxsmythe

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 23 '22 06:07 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 11 '22 02:10 stale[bot]

/remove-lifecycle stale

Dentrax avatar Apr 28 '23 12:04 Dentrax

Is this still ongoing? I haven't seen any follow-up from my original response.

maxsmythe avatar May 04 '23 01:05 maxsmythe