gatekeeper
gatekeeper copied to clipboard
Increase Gatekeeper Validation Webhook Timeout
Describe the solution you'd like [A clear and concise description of what you want to happen.] The gatekeeper validating webhook was previously lowered from 5 seconds to 3 seconds due to a potential leader election issue. See #870. Project Ratify implements an external data provider to interact with Gatekeeper and intermittently times out due to the 3 second limit. We'd like to propose to increase the default back to the original 5 seconds (or even higher).
Anything else you would like to add: The original issue was a result of Kubernetes' transition plan from using Endpoints/Congfigmaps to using Leases for leader election. The official migration plan is outlined here: https://github.com/kubernetes/kubernetes/issues/80289. Starting in version 1.17, K8 used EndpointsLease as the default resource type for leader election. This required 2 sequential resource updates in the leader renewal process. Since the validating webhook timeout was 5 seconds, it was possible the CM/SCHE would give up before the API server would respond (the leader renewal timeout is 10 seconds by default).
Since version 1.20, K8 has migrated to making the Lease resource the default value. See the CM and SCHE documentation for the --leader-elect-resource-lock flag here and here. This is also confirmed in the default config files in the CM and SCHE here and here.
Now that the Lease resource is the default, only one update needs to be made in the renewal process which should mitigate the previous issue. Can the default be changed back? Are there any other concerns to consider now that the original problem seems to be mitigated?
Environment:
- Gatekeeper version:
- Kubernetes version: (use
kubectl version
): > 1.20
I'm not opposed to raising the default timeout if it doesn't interfere with cluster operations.
My biggest concern would be to make sure the long-tail of users are mostly off of the old behavior before changing the defaults.
From your description, this would mean everyone being on k8s version >= 1.20. Also, leader election is a general purpose library, would we need to worry about interfering with other projects?
Mutation would still require a lower timeout, as it is subject to reinvocation policy.
Thanks @akashsinghal for sharing the changes in v1.20. If we do decide to update this, v1.19 is already EOL. If we are worried about backward compatibility, we can set the default values conditional on k8s versions in the helm chart.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
Recently the timeout was just lowered to 1 second: https://github.com/open-policy-agent/gatekeeper/pull/1913
I'm actually seeing very frequent timeouts now when there's a 1-second timeout:
This is a pretty big cluster (700+ pods), but metrics on gatekeeper look good, it has plenty of resources, no CPU limit, 3 replicas, etc.
I believe the issue with leaderelection shouldn't be solved by messing with the timeout since it's quite hacky IMO. people concerned about that issue should instead configure the webhooks to not intercept leaderelection resources, using validating/mutatingWebhookCustomRules
: https://github.com/open-policy-agent/gatekeeper/pull/1806