gatekeeper icon indicating copy to clipboard operation
gatekeeper copied to clipboard

installation with helm stucks till reaching timeout

Open skattaa opened this issue 3 years ago • 18 comments

What steps did you take and what happened: Good day , I have tried to deploy gatekeeper using helm charts as : helm install -n gatekeeper-system gatekeeper gatekeeper/gatekeeper it keeps stucking, and helm status shows "pending-install" .. then it fails after reaching timeout .

What did you expect to happen: gatekeeper is installed and running

Anything else you would like to add: by using --debug flag , it shows that it stuck where a job.bath starts , this job starts a pod "gatekeeper-update-namespace-label--1-z74vh" which adds labels to the gatekeeper namespace . This pod keeps starting then turns into error status . by showing the logs of this pod , it seems to show the below error :

I0302 08:46:35.941139       1 request.go:668] Waited for 1.192226109s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/cert-manager.io/v1beta1?timeout=32s
Error from server (InternalError): Internal error occurred: failed calling webhook "check-ignore-label.gatekeeper.sh": Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admitlabel?timeout=3s": context deadline exceeded

This shows a problem violating "API Priority and Fairness" The installation is fresh , no older version installed.

The interesting fact is that, the installation method (per official document) with :

kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/release-3.7/deploy/gatekeeper.yaml never use the former jobs.batch "gatekeeper-update-namespace-label", which cased the problem with the installation with helm . kubernetes is based on rke2 , version 1.22.3

Environment:

  • Gatekeeper version: 3.7
  • Kubernetes version: (use kubectl version): v1.22.3+rke2r1
  • Kubectl version : 1.22.4
  • Helm version : 3.6

skattaa avatar Mar 02 '22 09:03 skattaa

We've seen the same behavior with k3s 1.21.4, gatekeeper 3.7.0, and helm 3.6.

mes5k avatar Apr 13 '22 23:04 mes5k

any idea ?

skattaa avatar May 10 '22 06:05 skattaa

I have this problem too on CD solution we use which only has Helm v3.1.2. It doesn't happen when I use Helm v3.8.0 on a different environment. But it looks like a different reason than yours.

I have this in the event logs of the cluster. From what I can see, helm creates the job before it creates the service account and job never starts, so helm timeouts at the end. So we can just blame old Helm for this.

48s Warning FailedCreate job/gatekeeper-update-namespace-label Error creating: pods "gatekeeper-update-namespace-label-" is forbidden: error looking up service account gatekeeper/gatekeeper-update-namespace-label: serviceaccount "gatekeeper-update-namespace-label" not found

A workaround I can think of is to disable that post-install update task, chart's values has the option for it, and update label manually via CLI/IAC or whatever you have. I haven't tested it yet though.

alperb-icp avatar May 24 '22 12:05 alperb-icp

I am also facing same issue while installing gatekeeper via helm v3.8.1

mandeepgoyat avatar May 24 '22 16:05 mandeepgoyat

Same problem with Helm Version v3.0.1 install gatekeeper using helm chart

LijieZhou avatar Jun 08 '22 17:06 LijieZhou

Here are the steps I tried to repro the issue:

  1. Create a fresh 1.22.4 cluster with kind.
  2. Install helm 3.6.0
  3. helm install --create-namespace -n gatekeeper-system gatekeeper gatekeeper/gatekeeper --version 3.7.0 --debug
  4. Successfully installed gatekeeper.

I repeated the steps above with helm 3.7.0, 3.8.0, and 3.9.0 and I wasn't able to repro the issue.

When you said fresh install, did you make sure the gatekeeper-system namespace is deleted? It'd be nice if we can have the entire helm install debug log. I'd also recommend upgrading helm version.

Same problem with Helm Version v3.0.1 install gatekeeper using helm chart

I'd recommend upgrading your helm version since v3.0.1 is too old.

chewong avatar Jun 09 '22 17:06 chewong

@chewong In my experience this is an intermittent problem. At one point I was seeing failures maybe 1 in 10 times.

I've had some luck adding --timeout 10m to the helm install command.

At one point I got the impression that the update-namespace-label pod was starting, attempting to do whatever it does, failing and then restarting relying on k8s crash loop backoff. Since that backs off before restarting it doesn't take too many failures before the next restart attempt happens after the timeout threshold.

mes5k avatar Jun 09 '22 18:06 mes5k

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 18 '22 21:09 stale[bot]

Would be nice to see this resolved.

mes5k avatar Sep 18 '22 22:09 mes5k

dongillies@bl-mbp16-a3041:[a-leo]~$ helm version version.BuildInfo{Version:"v3.9.4", GitCommit:"dbc6d8e20fe1d58d50e6ed30f09a04a77e4c68db", GitTreeState:"clean", GoVersion:"go1.19"}

kubernetes v1.21 (aws)

commandline install fails with gatekeeper v3.7.0, v3.8.0, v3.9.0. Here is a log from v3.7.0.

kubectl delete namespace gatekeeper-system gatekeeper-policy-manager
kubectl delete crd -l gatekeeper.sh/system=yes
helm install -n gatekeeper-system --version v3.7.0  gatekeeper gatekeeper/gatekeeper --create-namespace --debug
Screen Shot 2022-09-27 at 3 00 26 PM
$ k logs gatekeeper-update-namespace-label-nsw4q -n gatekeeper-system
I0927 22:02:03.617574       1 request.go:668] Waited for 1.162965264s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/templates.gatekeeper.sh/v1alpha1?timeout=32s
I0927 22:02:13.618160       1 request.go:668] Waited for 11.162924478s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/cert-manager.io/v1?timeout=32s
Error from server (InternalError): Internal error occurred: failed calling webhook "check-ignore-label.gatekeeper.sh": Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admitlabel?timeout=3s": Address is not allowed

$ helm install -n gatekeeper-system --version v3.7.0  gatekeeper gatekeeper/gatekeeper --create-namespace --debug
install.go:178: [debug] Original chart version: "v3.7.0"
install.go:195: [debug] CHART PATH: /Users/dongillies/Library/Caches/helm/repository/gatekeeper-3.7.0.tgz

client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
install.go:165: [debug] Clearing discovery cache
wait.go:48: [debug] beginning wait for 9 resources with timeout of 1m0s
W0927 14:59:31.729782   35184 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ServiceAccount
client.go:339: [debug] serviceaccounts "gatekeeper-admin-upgrade-crds" not found
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ClusterRole
client.go:339: [debug] clusterroles.rbac.authorization.k8s.io "gatekeeper-admin-upgrade-crds" not found
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ClusterRoleBinding
client.go:339: [debug] clusterrolebindings.rbac.authorization.k8s.io "gatekeeper-admin-upgrade-crds" not found
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-update-crds-hook" Job
client.go:339: [debug] jobs.batch "gatekeeper-update-crds-hook" not found
client.go:128: [debug] creating 1 resource(s)
client.go:540: [debug] Watching for changes to Job gatekeeper-update-crds-hook with timeout of 5m0s
client.go:568: [debug] Add/Modify event for gatekeeper-update-crds-hook: ADDED
client.go:607: [debug] gatekeeper-update-crds-hook: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for gatekeeper-update-crds-hook: MODIFIED
client.go:607: [debug] gatekeeper-update-crds-hook: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for gatekeeper-update-crds-hook: MODIFIED
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ServiceAccount
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ClusterRole
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ClusterRoleBinding
client.go:310: [debug] Starting delete for "gatekeeper-update-crds-hook" Job
client.go:128: [debug] creating 14 resource(s)
W0927 14:59:39.142965   35184 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
client.go:310: [debug] Starting delete for "gatekeeper-update-namespace-label" ServiceAccount
client.go:339: [debug] serviceaccounts "gatekeeper-update-namespace-label" not found
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-update-namespace-label" Role
client.go:339: [debug] roles.rbac.authorization.k8s.io "gatekeeper-update-namespace-label" not found
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-update-namespace-label" RoleBinding
client.go:339: [debug] rolebindings.rbac.authorization.k8s.io "gatekeeper-update-namespace-label" not found
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-update-namespace-label" Job
client.go:339: [debug] jobs.batch "gatekeeper-update-namespace-label" not found
client.go:128: [debug] creating 1 resource(s)
client.go:540: [debug] Watching for changes to Job gatekeeper-update-namespace-label with timeout of 5m0s
client.go:568: [debug] Add/Modify event for gatekeeper-update-namespace-label: ADDED
client.go:607: [debug] gatekeeper-update-namespace-label: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for gatekeeper-update-namespace-label: MODIFIED
client.go:607: [debug] gatekeeper-update-namespace-label: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
Error: INSTALLATION FAILED: failed post-install: timed out waiting for the condition
helm.go:84: [debug] failed post-install: timed out waiting for the condition
INSTALLATION FAILED
main.newInstallCmd.func2
	helm.sh/helm/v3/cmd/helm/install.go:127
github.com/spf13/cobra.(*Command).execute
	github.com/spf13/[email protected]/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
	github.com/spf13/[email protected]/command.go:974
github.com/spf13/cobra.(*Command).Execute
	github.com/spf13/[email protected]/command.go:902
main.main
	helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
	runtime/proc.go:250
runtime.goexit
	runtime/asm_amd64.s:1594

dwgillies-bluescape avatar Sep 27 '22 22:09 dwgillies-bluescape

We are installing gatekeeper via terraform cloud. Terraform cloud has a helm provider with a 15 minutes timeout. We are installing roughly 130 resources including 50+ helm charts. Once we perform a terraform plan, the 15-minutes clock is ticking. Typically we don't notice that the terraform workspace is ready to apply until 1-2 mins later. So, that means the clock is ticking to get 130 resources including 50 charts installed in 13 minutes. For gatekeeper to take 5+ minutes to install correctly, is unreasonable in our environment. If adding extra time actually works its the slowest helm chart we have ever seen. It hasn't been working for several months for us.

dwgillies-bluescape avatar Sep 27 '22 22:09 dwgillies-bluescape

@dwgillies-bluescape

It looks like you're deleting the namespace and trying to recreate it, leaving the ValidatingWebhookConfiguration (which points to a non-existent service due to the deleted namespace) in place.

This log line shows the failure calling the fail-closed webhook:

Error from server (InternalError): Internal error occurred: failed calling webhook "check-ignore-label.gatekeeper.sh": Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admitlabel?timeout=3s": Address is not allowed

Deleting Gatekeeper's ValidatingWebhookConfiguration would eliminate the deadlock in your example (Helm would recreate it).

maxsmythe avatar Sep 28 '22 01:09 maxsmythe

When using helm to install, please also use helm delete to remove ALL the resources deployed by this helm chart such that resources like ValidatingWebhookConfiguration won't be left behind. https://open-policy-agent.github.io/gatekeeper/website/docs/install#using-helm

Can you pls also give v3.9.0 a try since https://github.com/open-policy-agent/gatekeeper/pull/2052 checks for gatekeeper-webhook API availability as an initContainer to the gatekeeper-update-namespace-label job such that the namespace label container only runs after it confirms the webhook is available.

ritazh avatar Sep 28 '22 01:09 ritazh

In cases where the pod is present and bootstrapping, the delays are likely due to readiness. /readyz wont show true until all data is cached, constraint templates observed, and whatever other bootstrapping is complete. If /readyz is returning false, then K8s LBs wont route traffic to the Service, which causes calls to that webhook to fail.

The namespace label check webhook technically does not have any bootstrapping dependencies (assuming a TLS cert is present), so it could be possible to host it on a separate pod (this config is not in the Helm chart), at the cost of some wasted resources.

Other solutions (assuming Rita's post, which showed up as I wrote this, doesn't work):

  • create Gatekeeper's namespace outside of the Helm chart (including the appropriate label). Unfortunately it doesn't look like we can apply the [ignore label](Error from server (InternalError): Internal error occurred: failed calling webhook "check-ignore-label.gatekeeper.sh": Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admitlabel?timeout=3s": Address is not allowed) in Helm until after G8r is installed, so this can't be fixed by changing the chart.

  • Set .Values.validatingWebhookCheckIgnoreFailurePolicy to Ignore, note that this makes the ability to create/update namespaces a privileged operation that can bypass policy.

  • Limit G8r's dependency set such that the bootstrapping happens in an acceptable timeframe (note that if you are experiencing client throttling due to the # of unique GroupVersions in your cluster causing the client to throttle the discovery API, the latency may not be fixable without an update to disable/configure throttling on our end)

maxsmythe avatar Sep 28 '22 01:09 maxsmythe

We are running istio and cilium on a govcloud kubernetes cluster on aws. Installation works on a non-istio similar cluster. The gatekeeper-update-namespace-label pod finishes in about 5-10 secs in THAT cluster.

@maxsmythe I have added your suggested commands to delete the hooks. @ritazh I have added your suggestion to delete the helm chart, use v3.9.0, and I also delete the namespace. the init container is succeeding its just the openpolicyagent/gatekeeper-crds:v3.9.0 container that fails.

kubectl delete mutatingwebhookconfigurations gatekeeper-mutating-webhook-configuration
kubectl delete validatingwebhookconfigurations gatekeeper-validating-webhook-configuration
kubectl delete crd -l gatekeeper.sh/system=yes
helm delete gatekeeper -n gatekeeper-system
kubectl delete namespace gatekeeper-system
helm install -n gatekeeper-system --version v3.9.0  gatekeeper gatekeeper/gatekeeper --create-namespace --debug --timeout 900s

install.go:178: [debug] Original chart version: "v3.9.0"
install.go:195: [debug] CHART PATH: /Users/dongillies/Library/Caches/helm/repository/gatekeeper-3.9.0.tgz

client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 1 resource(s)
install.go:165: [debug] Clearing discovery cache
wait.go:48: [debug] beginning wait for 9 resources with timeout of 1m0s
W0927 19:31:41.315993   47872 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ServiceAccount
client.go:339: [debug] serviceaccounts "gatekeeper-admin-upgrade-crds" not found
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ClusterRole
client.go:339: [debug] clusterroles.rbac.authorization.k8s.io "gatekeeper-admin-upgrade-crds" not found
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ClusterRoleBinding
client.go:339: [debug] clusterrolebindings.rbac.authorization.k8s.io "gatekeeper-admin-upgrade-crds" not found
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-update-crds-hook" Job
client.go:339: [debug] jobs.batch "gatekeeper-update-crds-hook" not found
client.go:128: [debug] creating 1 resource(s)
client.go:540: [debug] Watching for changes to Job gatekeeper-update-crds-hook with timeout of 15m0s
client.go:568: [debug] Add/Modify event for gatekeeper-update-crds-hook: ADDED
client.go:607: [debug] gatekeeper-update-crds-hook: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for gatekeeper-update-crds-hook: MODIFIED
client.go:607: [debug] gatekeeper-update-crds-hook: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for gatekeeper-update-crds-hook: MODIFIED
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ServiceAccount
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ClusterRole
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ClusterRoleBinding
client.go:310: [debug] Starting delete for "gatekeeper-update-crds-hook" Job
client.go:128: [debug] creating 14 resource(s)
W0927 19:31:48.767738   47872 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
client.go:310: [debug] Starting delete for "gatekeeper-update-namespace-label" ServiceAccount
client.go:339: [debug] serviceaccounts "gatekeeper-update-namespace-label" not found
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-update-namespace-label" Role
client.go:339: [debug] roles.rbac.authorization.k8s.io "gatekeeper-update-namespace-label" not found
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-update-namespace-label" RoleBinding
client.go:339: [debug] rolebindings.rbac.authorization.k8s.io "gatekeeper-update-namespace-label" not found
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-update-namespace-label" Job
client.go:339: [debug] jobs.batch "gatekeeper-update-namespace-label" not found
client.go:128: [debug] creating 1 resource(s)
client.go:540: [debug] Watching for changes to Job gatekeeper-update-namespace-label with timeout of 15m0s
client.go:568: [debug] Add/Modify event for gatekeeper-update-namespace-label: ADDED
client.go:607: [debug] gatekeeper-update-namespace-label: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for gatekeeper-update-namespace-label: MODIFIED
client.go:607: [debug] gatekeeper-update-namespace-label: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
Error: INSTALLATION FAILED: failed post-install: job failed: BackoffLimitExceeded
helm.go:84: [debug] failed post-install: job failed: BackoffLimitExceeded
INSTALLATION FAILED
main.newInstallCmd.func2
        helm.sh/helm/v3/cmd/helm/install.go:127
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/[email protected]/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/[email protected]/command.go:974
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/[email protected]/command.go:902
main.main
        helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
        runtime/proc.go:250
runtime.goexit
        runtime/asm_amd64.s:1594

# gatekeeper-update-namespace-label still fails, after 5 attempts, in 6 mins

Error from server (InternalError): Internal error occurred: failed calling webhook "check-ignore-label.gatekeeper.sh": Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admitlabel?timeout=3s": Address is not allowed
Error from server (InternalError): Internal error occurred: failed calling webhook "check-ignore-label.gatekeeper.sh": Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admitlabel?timeout=3s": Address is not allowed
Error from server (InternalError): Internal error occurred: failed calling webhook "check-ignore-label.gatekeeper.sh": Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admitlabel?timeout=3s": Address is not allowed
Error from server (InternalError): Internal error occurred: failed calling webhook "check-ignore-label.gatekeeper.sh": Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admitlabel?timeout=3s": Address is not allowed
Error from server (InternalError): Internal error occurred: failed calling webhook "check-ignore-label.gatekeeper.sh": Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admitlabel?timeout=3s": Address is not allowed
dongillies@bl-mbp16-a3041:[a-gstg1]~/repo/terraform-helm-k8s-generic$ helm list -A
NAME                                            NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                                           APP VERSION
...
gatekeeper                                      gatekeeper-system       1               2022-09-27 19:10:51.567846 -0700 PDT    failed          gatekeeper-3.9.0                                v3.9.0     
$ kubectl get events --sort-by=.metadata.creationTimestamp -A
...
gatekeeper-system   10m         Normal    Scheduled                 pod/gatekeeper-update-namespace-label-zgkts           Successfully assigned gatekeeper-system/gatekeeper-update-namespace-label-zgkts to ip-10-64-124-100.us-gov-west-1.compute.internal
gatekeeper-system   10m         Normal    Pulled                    pod/gatekeeper-controller-manager-5447bbd765-js5cd    Container image "openpolicyagent/gatekeeper:v3.9.0" already present on machine
gatekeeper-system   10m         Normal    Pulled                    pod/gatekeeper-controller-manager-5447bbd765-dzx74    Container image "openpolicyagent/gatekeeper:v3.9.0" already present on machine
gatekeeper-system   10m         Normal    Created                   pod/gatekeeper-controller-manager-5447bbd765-js5cd    Created container manager
gatekeeper-system   10m         Normal    Started                   pod/gatekeeper-controller-manager-5447bbd765-js5cd    Started container manager
gatekeeper-system   10m         Normal    Started                   pod/gatekeeper-controller-manager-5447bbd765-dzx74    Started container manager
gatekeeper-system   10m         Warning   Unhealthy                 pod/gatekeeper-controller-manager-5447bbd765-js5cd    Readiness probe failed: Get "http://10.2.17.243:9090/readyz": dial tcp 10.2.17.243:9090: connect: connection refused
gatekeeper-system   9m46s       Normal    Pulled                    pod/gatekeeper-update-namespace-label-zgkts           Container image "curlimages/curl:7.83.1" already present on machine
gatekeeper-system   10m         Warning   Unhealthy                 pod/gatekeeper-controller-manager-5447bbd765-dzx74    Readiness probe failed: Get "http://10.2.29.96:9090/readyz": dial tcp 10.2.29.96:9090: connect: connection refused
gatekeeper-system   9m46s       Normal    Started                   pod/gatekeeper-update-namespace-label-zgkts           Started container webhook-probe-post
gatekeeper-system   9m46s       Normal    Created                   pod/gatekeeper-update-namespace-label-zgkts           Created container webhook-probe-post
gatekeeper-system   10m         Normal    Started                   pod/gatekeeper-audit-7df9d49f9c-f7g62                 Started container manager
gatekeeper-system   10m         Normal    Created                   pod/gatekeeper-audit-7df9d49f9c-f7g62                 Created container manager
gatekeeper-system   10m         Normal    Pulled                    pod/gatekeeper-audit-7df9d49f9c-f7g62                 Container image "openpolicyagent/gatekeeper:v3.9.0" already present on machine
gatekeeper-system   10m         Normal    Pulled                    pod/gatekeeper-controller-manager-5447bbd765-qnkjk    Container image "openpolicyagent/gatekeeper:v3.9.0" already present on machine
gatekeeper-system   10m         Normal    Created                   pod/gatekeeper-controller-manager-5447bbd765-qnkjk    Created container manager
gatekeeper-system   10m         Normal    Started                   pod/gatekeeper-controller-manager-5447bbd765-qnkjk    Started container manager
gatekeeper-system   10m         Warning   Unhealthy                 pod/gatekeeper-audit-7df9d49f9c-f7g62                 Readiness probe failed: Get "http://10.2.21.135:9090/readyz": dial tcp 10.2.21.135:9090: connect: connection refused
gatekeeper-system   10m         Warning   Unhealthy                 pod/gatekeeper-controller-manager-5447bbd765-qnkjk    Readiness probe failed: Get "http://10.2.21.12:9090/readyz": dial tcp 10.2.21.12:9090: connect: connection refused
gatekeeper-system   10m         Warning   BackOff                   pod/gatekeeper-update-namespace-label-zgkts           Back-off restarting failed container
gatekeeper-system   9m2s        Normal    Pulled                    pod/gatekeeper-update-namespace-label-zgkts           Container image "openpolicyagent/gatekeeper-crds:v3.9.0" already present on machine
gatekeeper-system   9m2s        Normal    Created                   pod/gatekeeper-update-namespace-label-zgkts           Created container kubectl-label
gatekeeper-system   9m2s        Normal    Started                   pod/gatekeeper-update-namespace-label-zgkts           Started container kubectl-label
gatekeeper-system   9m16s       Warning   BackOff                   pod/gatekeeper-update-namespace-label-zgkts           Back-off restarting failed container
gatekeeper-system   8m20s       Warning   BackoffLimitExceeded      job/gatekeeper-update-namespace-label                 Job has reached the specified backoff limit
gatekeeper-system   8m20s       Normal    SuccessfulDelete          job/gatekeeper-update-namespace-label                 Deleted pod: gatekeeper-update-namespace-label-zgkts
# right about here I probably did a helm upgrade --install of gatekeeper to set the status to "success"
gatekeeper-system   7m16s       Normal    Scheduled                 pod/gatekeeper-update-crds-hook-5gb9b                 Successfully assigned gatekeeper-system/gatekeeper-update-crds-hook-5gb9b to ip-10-64-124-100.us-gov-west-1.compute.internal
gatekeeper-system   7m16s       Normal    SuccessfulCreate          job/gatekeeper-update-crds-hook                       Created pod: gatekeeper-update-crds-hook-5gb9b
gatekeeper-system   7m14s       Normal    Pulled                    pod/gatekeeper-update-crds-hook-5gb9b                 Container image "openpolicyagent/gatekeeper-crds:v3.9.0" already present on machine
gatekeeper-system   7m14s       Normal    Created                   pod/gatekeeper-update-crds-hook-5gb9b                 Created container crds-upgrade
gatekeeper-system   7m14s       Normal    Started                   pod/gatekeeper-update-crds-hook-5gb9b                 Started container crds-upgrade
gatekeeper-system   7m12s       Normal    Completed                 job/gatekeeper-update-crds-hook                       Job completed

=======

our terraform and helm charts used to work, they used to install gatekeeper in february. There has been no change to this part of our configs since then, but now they are failing.

dwgillies-bluescape avatar Sep 28 '22 02:09 dwgillies-bluescape

The weird thing is that in the "failed" state I can change the helm gatekeeper install status to "succeeded" but I think this is just a bug in your helm chart implementation since it's not running the hook that breaks installation the second time. And we cannot run a helm upgrade from terraform.

$ helm upgrade --install -n gatekeeper-system --version v3.9.0  gatekeeper gatekeeper/gatekeeper --create-namespace --debug --timeout 900s

history.go:56: [debug] getting history for release gatekeeper
upgrade.go:142: [debug] preparing upgrade for gatekeeper
upgrade.go:150: [debug] performing update for gatekeeper
upgrade.go:322: [debug] creating upgraded release for gatekeeper
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ServiceAccount
client.go:339: [debug] serviceaccounts "gatekeeper-admin-upgrade-crds" not found
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ClusterRole
client.go:339: [debug] clusterroles.rbac.authorization.k8s.io "gatekeeper-admin-upgrade-crds" not found
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ClusterRoleBinding
client.go:339: [debug] clusterrolebindings.rbac.authorization.k8s.io "gatekeeper-admin-upgrade-crds" not found
client.go:128: [debug] creating 1 resource(s)
client.go:310: [debug] Starting delete for "gatekeeper-update-crds-hook" Job
client.go:339: [debug] jobs.batch "gatekeeper-update-crds-hook" not found
client.go:128: [debug] creating 1 resource(s)
client.go:540: [debug] Watching for changes to Job gatekeeper-update-crds-hook with timeout of 15m0s
client.go:568: [debug] Add/Modify event for gatekeeper-update-crds-hook: ADDED
client.go:607: [debug] gatekeeper-update-crds-hook: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for gatekeeper-update-crds-hook: MODIFIED
client.go:607: [debug] gatekeeper-update-crds-hook: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for gatekeeper-update-crds-hook: MODIFIED
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ServiceAccount
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ClusterRole
client.go:310: [debug] Starting delete for "gatekeeper-admin-upgrade-crds" ClusterRoleBinding
client.go:310: [debug] Starting delete for "gatekeeper-update-crds-hook" Job
client.go:229: [debug] checking 14 resources for changes
client.go:521: [debug] Patch ResourceQuota "gatekeeper-critical-pods" in namespace gatekeeper-system
W0927 19:34:50.962344   47875 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0927 19:34:50.997090   47875 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
client.go:512: [debug] Looks like there are no changes for PodSecurityPolicy "gatekeeper-admin"
W0927 19:34:51.030471   47875 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
client.go:521: [debug] Patch PodDisruptionBudget "gatekeeper-controller-manager" in namespace gatekeeper-system
client.go:512: [debug] Looks like there are no changes for ServiceAccount "gatekeeper-admin"
client.go:512: [debug] Looks like there are no changes for Secret "gatekeeper-webhook-server-cert"
client.go:521: [debug] Patch ClusterRole "gatekeeper-manager-role" in namespace 
client.go:512: [debug] Looks like there are no changes for ClusterRoleBinding "gatekeeper-manager-rolebinding"
client.go:521: [debug] Patch Role "gatekeeper-manager-role" in namespace gatekeeper-system
client.go:512: [debug] Looks like there are no changes for RoleBinding "gatekeeper-manager-rolebinding"
client.go:512: [debug] Looks like there are no changes for Service "gatekeeper-webhook-service"
client.go:521: [debug] Patch Deployment "gatekeeper-audit" in namespace gatekeeper-system
client.go:521: [debug] Patch Deployment "gatekeeper-controller-manager" in namespace gatekeeper-system
client.go:521: [debug] Patch MutatingWebhookConfiguration "gatekeeper-mutating-webhook-configuration" in namespace 
client.go:521: [debug] Patch ValidatingWebhookConfiguration "gatekeeper-validating-webhook-configuration" in namespace 
upgrade.go:157: [debug] updating status for upgraded release for gatekeeper
Release "gatekeeper" has been upgraded. Happy Helming!
NAME: gatekeeper
LAST DEPLOYED: Tue Sep 27 19:34:40 2022
NAMESPACE: gatekeeper-system
STATUS: deployed
REVISION: 2
TEST SUITE: None
USER-SUPPLIED VALUES:
{}

COMPUTED VALUES:
audit:
  affinity: {}
  disableCertRotation: true
  dnsPolicy: ClusterFirst
  extraRules: []
  healthPort: 9090
  hostNetwork: false
  metricsPort: 8888
  nodeSelector:
    kubernetes.io/os: linux
  podSecurityContext:
    fsGroup: 999
    supplementalGroups:
    - 999
  priorityClassName: system-cluster-critical
  resources:
    limits:
      cpu: 1000m
      memory: 512Mi
    requests:
      cpu: 100m
      memory: 256Mi
  securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
      - all
    readOnlyRootFilesystem: true
    runAsGroup: 999
    runAsNonRoot: true
    runAsUser: 1000
  tolerations: []
  writeToRAMDisk: false
auditChunkSize: 500
auditFromCache: false
auditInterval: 60
auditMatchKindOnly: false
constraintViolationsLimit: 20
controllerManager:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: gatekeeper.sh/operation
              operator: In
              values:
              - webhook
          topologyKey: kubernetes.io/hostname
        weight: 100
  disableCertRotation: false
  dnsPolicy: ClusterFirst
  exemptNamespacePrefixes: []
  exemptNamespaces: []
  extraRules: []
  healthPort: 9090
  hostNetwork: false
  metricsPort: 8888
  nodeSelector:
    kubernetes.io/os: linux
  podSecurityContext:
    fsGroup: 999
    supplementalGroups:
    - 999
  port: 8443
  priorityClassName: system-cluster-critical
  resources:
    limits:
      cpu: 1000m
      memory: 512Mi
    requests:
      cpu: 100m
      memory: 256Mi
  securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
      - all
    readOnlyRootFilesystem: true
    runAsGroup: 999
    runAsNonRoot: true
    runAsUser: 1000
  tolerations: []
crds:
  resources: {}
  securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
      - all
    readOnlyRootFilesystem: true
    runAsGroup: 65532
    runAsNonRoot: true
    runAsUser: 65532
disableMutation: false
disableValidatingWebhook: false
disabledBuiltins:
- '{http.send}'
emitAdmissionEvents: false
emitAuditEvents: false
enableDeleteOperations: false
enableExternalData: false
enableRuntimeDefaultSeccompProfile: true
enableTLSHealthcheck: false
image:
  crdRepository: openpolicyagent/gatekeeper-crds
  pullPolicy: IfNotPresent
  pullSecrets: []
  release: v3.9.0
  repository: openpolicyagent/gatekeeper
logDenies: false
logLevel: INFO
logMutations: false
metricsBackends:
- prometheus
mutatingWebhookCustomRules: {}
mutatingWebhookExemptNamespacesLabels: {}
mutatingWebhookFailurePolicy: Ignore
mutatingWebhookObjectSelector: {}
mutatingWebhookReinvocationPolicy: Never
mutatingWebhookTimeoutSeconds: 1
mutationAnnotations: false
pdb:
  controllerManager:
    minAvailable: 1
podAnnotations: {}
podCountLimit: 100
podLabels: {}
postInstall:
  labelNamespace:
    enabled: true
    extraNamespaces: []
    extraRules: []
    image:
      pullPolicy: IfNotPresent
      pullSecrets: []
      repository: openpolicyagent/gatekeeper-crds
      tag: v3.9.0
  probeWebhook:
    enabled: true
    httpTimeout: 2
    image:
      pullPolicy: IfNotPresent
      pullSecrets: []
      repository: curlimages/curl
      tag: 7.83.1
    insecureHTTPS: false
    waitTimeout: 60
  securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
      - all
    readOnlyRootFilesystem: true
    runAsGroup: 999
    runAsNonRoot: true
    runAsUser: 1000
postUpgrade:
  labelNamespace:
    enabled: false
    extraNamespaces: []
    image:
      pullPolicy: IfNotPresent
      pullSecrets: []
      repository: openpolicyagent/gatekeeper-crds
      tag: v3.9.0
  securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
      - all
    readOnlyRootFilesystem: true
    runAsGroup: 999
    runAsNonRoot: true
    runAsUser: 1000
preUninstall:
  deleteWebhookConfigurations:
    enabled: false
    extraRules: []
    image:
      pullPolicy: IfNotPresent
      pullSecrets: []
      repository: openpolicyagent/gatekeeper-crds
      tag: v3.9.0
  securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
      - all
    readOnlyRootFilesystem: true
    runAsGroup: 999
    runAsNonRoot: true
    runAsUser: 1000
psp:
  enabled: true
rbac:
  create: true
replicas: 3
resourceQuota: true
secretAnnotations: {}
service: {}
upgradeCRDs:
  enabled: true
  extraRules: []
  tolerations: []
validatingWebhookCheckIgnoreFailurePolicy: Fail
validatingWebhookCustomRules: {}
validatingWebhookExemptNamespacesLabels: {}
validatingWebhookFailurePolicy: Ignore
validatingWebhookObjectSelector: {}
validatingWebhookTimeoutSeconds: 3

HOOKS:
---
# Source: gatekeeper/templates/namespace-post-install.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gatekeeper-update-namespace-label
  labels:
    release: gatekeeper
    heritage: Helm
  annotations:
    "helm.sh/hook": post-install
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": hook-succeeded,before-hook-creation
---
# Source: gatekeeper/templates/upgrade-crds-hook.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    release: gatekeeper
    heritage: Helm
  name: gatekeeper-admin-upgrade-crds
  namespace: 'gatekeeper-system'
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: "hook-succeeded,before-hook-creation"
    helm.sh/hook-weight: "1"
---
# Source: gatekeeper/templates/upgrade-crds-hook.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: gatekeeper-admin-upgrade-crds
  labels:
    release: gatekeeper
    heritage: Helm
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: "hook-succeeded,before-hook-creation"
    helm.sh/hook-weight: "1"
rules:
  - apiGroups: ["apiextensions.k8s.io"]
    resources: ["customresourcedefinitions"]
    verbs: ["get", "create", "update", "patch"]
---
# Source: gatekeeper/templates/upgrade-crds-hook.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gatekeeper-admin-upgrade-crds
  labels:
    release: gatekeeper
    heritage: Helm
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: "hook-succeeded,before-hook-creation"
    helm.sh/hook-weight: "1"
subjects:
  - kind: ServiceAccount
    name: gatekeeper-admin-upgrade-crds
    namespace: gatekeeper-system
roleRef:
  kind: ClusterRole
  name: gatekeeper-admin-upgrade-crds
  apiGroup: rbac.authorization.k8s.io
---
# Source: gatekeeper/templates/namespace-post-install.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: gatekeeper-update-namespace-label
  labels:
    release: gatekeeper
    heritage: Helm
  annotations:
    "helm.sh/hook": post-install
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": hook-succeeded,before-hook-creation
rules:
  - apiGroups:
      - ""
    resources:
      - namespaces
    verbs:
      - get
      - update
      - patch
    resourceNames:
      - gatekeeper-system
---
# Source: gatekeeper/templates/namespace-post-install.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: gatekeeper-update-namespace-label
  labels:
    release: gatekeeper
    heritage: Helm
  annotations:
    "helm.sh/hook": post-install
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": hook-succeeded,before-hook-creation
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: gatekeeper-update-namespace-label
subjects:
  - kind: ServiceAccount
    name: gatekeeper-update-namespace-label
    namespace: "gatekeeper-system"
---
# Source: gatekeeper/templates/namespace-post-install.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: gatekeeper-update-namespace-label
  labels:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
  annotations:
    "helm.sh/hook": post-install
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": hook-succeeded,before-hook-creation
spec:
  template:
    metadata:
      annotations:
        {}
      labels:
        app: 'gatekeeper'
        release: 'gatekeeper'
    spec:
      restartPolicy: OnFailure
      serviceAccount: gatekeeper-update-namespace-label
      nodeSelector:
        kubernetes.io/os: linux
      volumes:
        - name: cert
          secret:
            secretName: gatekeeper-webhook-server-cert
      initContainers:
        - name: webhook-probe-post
          image: "curlimages/curl:7.83.1"
          imagePullPolicy: IfNotPresent
          args:
            - "--retry"
            - "99999"
            - "--retry-max-time"
            - "60"
            - "--retry-delay"
            - "1"
            - "--max-time"
            - "2"
            - "--cacert"
            - /certs/ca.crt
            - "-v"
            - "https://gatekeeper-webhook-service.gatekeeper-system.svc/v1/admitlabel?timeout=2s"
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
              - all
            readOnlyRootFilesystem: true
            runAsGroup: 999
            runAsNonRoot: true
            runAsUser: 1000
          volumeMounts:
          - mountPath: /certs
            name: cert
            readOnly: true
      containers:
        - name: kubectl-label
          image: "openpolicyagent/gatekeeper-crds:v3.9.0"
          imagePullPolicy: IfNotPresent
          args:
            - label
            - ns
            - gatekeeper-system
            - admission.gatekeeper.sh/ignore=no-self-managing
            - --overwrite
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
              - all
            readOnlyRootFilesystem: true
            runAsGroup: 999
            runAsNonRoot: true
            runAsUser: 1000
---
# Source: gatekeeper/templates/upgrade-crds-hook.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: gatekeeper-update-crds-hook
  namespace: gatekeeper-system
  labels:
    app: gatekeeper
    chart: gatekeeper
    gatekeeper.sh/system: "yes"
    heritage: Helm
    release: gatekeeper
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-weight: "1"
    helm.sh/hook-delete-policy: "hook-succeeded,before-hook-creation"
spec:
  backoffLimit: 0
  template:
    metadata:
      name: gatekeeper-update-crds-hook
      annotations:
        {}
    spec:
      serviceAccountName: gatekeeper-admin-upgrade-crds
      restartPolicy: Never
      containers:
      - name: crds-upgrade
        image: 'openpolicyagent/gatekeeper-crds:v3.9.0'
        imagePullPolicy: 'IfNotPresent'
        args:
        - apply
        - -f
        - crds/
        resources:
          {}
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - all
          readOnlyRootFilesystem: true
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65532
      affinity:
        null
      nodeSelector:
        kubernetes.io/os: linux
      tolerations:
        []
MANIFEST:
---
# Source: gatekeeper/templates/gatekeeper-critical-pods-resourcequota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  labels:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
  name: gatekeeper-critical-pods
  namespace: 'gatekeeper-system'
spec:
  hard:
    pods: 100
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
      - system-cluster-critical
      - system-cluster-critical
---
# Source: gatekeeper/templates/gatekeeper-admin-podsecuritypolicy.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  annotations:
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
  labels:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
  name: gatekeeper-admin
spec:
  allowPrivilegeEscalation: false
  fsGroup:
    ranges:
    - max: 65535
      min: 1
    rule: MustRunAs
  requiredDropCapabilities:
  - ALL
  runAsUser:
    rule: MustRunAsNonRoot
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    ranges:
    - max: 65535
      min: 1
    rule: MustRunAs
  volumes:
  - configMap
  - projected
  - secret
  - downwardAPI
  - emptyDir
---
# Source: gatekeeper/templates/gatekeeper-controller-manager-poddisruptionbudget.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  labels:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
  name: gatekeeper-controller-manager
  namespace: 'gatekeeper-system'
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: 'gatekeeper'
      chart: 'gatekeeper'
      control-plane: controller-manager
      gatekeeper.sh/operation: webhook
      gatekeeper.sh/system: "yes"
      heritage: 'Helm'
      release: 'gatekeeper'
---
# Source: gatekeeper/templates/gatekeeper-admin-serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
  name: gatekeeper-admin
  namespace: 'gatekeeper-system'
---
# Source: gatekeeper/templates/gatekeeper-webhook-server-cert-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  annotations:
    {}
  labels:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
  name: gatekeeper-webhook-server-cert
  namespace: 'gatekeeper-system'
---
# Source: gatekeeper/templates/gatekeeper-manager-role-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  creationTimestamp: null
  labels:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
  name: gatekeeper-manager-role
rules:
- apiGroups:
  - '*'
  resources:
  - '*'
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - admissionregistration.k8s.io
  resourceNames:
  - gatekeeper-mutating-webhook-configuration
  resources:
  - mutatingwebhookconfigurations
  verbs:
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - config.gatekeeper.sh
  resources:
  - configs
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - config.gatekeeper.sh
  resources:
  - configs/status
  verbs:
  - get
  - patch
  - update
- apiGroups:
  - constraints.gatekeeper.sh
  resources:
  - '*'
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - externaldata.gatekeeper.sh
  resources:
  - providers
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - mutations.gatekeeper.sh
  resources:
  - '*'
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - policy
  resourceNames:
  - gatekeeper-admin
  resources:
  - podsecuritypolicies
  verbs:
  - use
- apiGroups:
  - status.gatekeeper.sh
  resources:
  - '*'
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - templates.gatekeeper.sh
  resources:
  - constrainttemplates
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - templates.gatekeeper.sh
  resources:
  - constrainttemplates/finalizers
  verbs:
  - delete
  - get
  - patch
  - update
- apiGroups:
  - templates.gatekeeper.sh
  resources:
  - constrainttemplates/status
  verbs:
  - get
  - patch
  - update
- apiGroups:
  - admissionregistration.k8s.io
  resourceNames:
  - gatekeeper-validating-webhook-configuration
  resources:
  - validatingwebhookconfigurations
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
---
# Source: gatekeeper/templates/gatekeeper-manager-rolebinding-clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
  name: gatekeeper-manager-rolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: gatekeeper-manager-role
subjects:
- kind: ServiceAccount
  name: gatekeeper-admin
  namespace: 'gatekeeper-system'
---
# Source: gatekeeper/templates/gatekeeper-manager-role-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  creationTimestamp: null
  labels:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
  name: gatekeeper-manager-role
  namespace: 'gatekeeper-system'
rules:
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
- apiGroups:
  - ""
  resources:
  - secrets
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
---
# Source: gatekeeper/templates/gatekeeper-manager-rolebinding-rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
  name: gatekeeper-manager-rolebinding
  namespace: 'gatekeeper-system'
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: gatekeeper-manager-role
subjects:
- kind: ServiceAccount
  name: gatekeeper-admin
  namespace: 'gatekeeper-system'
---
# Source: gatekeeper/templates/gatekeeper-webhook-service-service.yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
  name: gatekeeper-webhook-service
  namespace: 'gatekeeper-system'
spec:
  
  ports:
  - name: https-webhook-server
    port: 443
    targetPort: webhook-server
  selector:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    control-plane: controller-manager
    gatekeeper.sh/operation: webhook
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
---
# Source: gatekeeper/templates/gatekeeper-audit-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    control-plane: audit-controller
    gatekeeper.sh/operation: audit
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
  name: gatekeeper-audit
  namespace: 'gatekeeper-system'
spec:
  replicas: 1
  selector:
    matchLabels:
      app: 'gatekeeper'
      chart: 'gatekeeper'
      control-plane: audit-controller
      gatekeeper.sh/operation: audit
      gatekeeper.sh/system: "yes"
      heritage: 'Helm'
      release: 'gatekeeper'
  template:
    metadata:
      annotations:
      labels:
        app: 'gatekeeper'
        chart: 'gatekeeper'
        control-plane: audit-controller
        gatekeeper.sh/operation: audit
        gatekeeper.sh/system: "yes"
        heritage: 'Helm'
        release: 'gatekeeper'
    spec:
      affinity:
        {}
      automountServiceAccountToken: true
      containers:
      -
        image: openpolicyagent/gatekeeper:v3.9.0
        args:
        - --audit-interval=60
        - --log-level=INFO
        - --constraint-violations-limit=20
        - --audit-from-cache=false
        - --audit-chunk-size=500
        - --audit-match-kind-only=false
        - --emit-audit-events=false
        - --operation=audit
        - --operation=status
        - --operation=mutation-status
        - --logtostderr
        - --health-addr=:9090
        - --prometheus-port=8888
        - --enable-external-data=false
        - --metrics-backend=prometheus
        - --disable-cert-rotation=true
        command:
        - /manager
        env:
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: CONTAINER_NAME
          value: manager
        imagePullPolicy: 'IfNotPresent'
        livenessProbe:
          httpGet:
            path: /healthz
            port: 9090
        name: manager
        ports:
        - containerPort: 8888
          name: metrics
          protocol: TCP
        - containerPort: 9090
          name: healthz
          protocol: TCP
        readinessProbe:
          httpGet:
            path: /readyz
            port: 9090
        resources:
          limits:
            cpu: 1000m
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 256Mi
        securityContext:
          seccompProfile:
            type: RuntimeDefault
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - all
          readOnlyRootFilesystem: true
          runAsGroup: 999
          runAsNonRoot: true
          runAsUser: 1000
        volumeMounts:
        - mountPath: /certs
          name: cert
          readOnly: true
        - mountPath: /tmp/audit
          name: tmp-volume
      dnsPolicy: ClusterFirst
      hostNetwork: false
      imagePullSecrets:
        []
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName:  system-cluster-critical
      securityContext:
        fsGroup: 999
        supplementalGroups:
        - 999
      serviceAccountName: gatekeeper-admin
      terminationGracePeriodSeconds: 60
      tolerations:
        []
      volumes:
      - name: cert
        secret:
          defaultMode: 420
          secretName: gatekeeper-webhook-server-cert
      - emptyDir: {}
        name: tmp-volume
---
# Source: gatekeeper/templates/gatekeeper-controller-manager-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    control-plane: controller-manager
    gatekeeper.sh/operation: webhook
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
  name: gatekeeper-controller-manager
  namespace: 'gatekeeper-system'
spec:
  replicas: 3
  selector:
    matchLabels:
      app: 'gatekeeper'
      chart: 'gatekeeper'
      control-plane: controller-manager
      gatekeeper.sh/operation: webhook
      gatekeeper.sh/system: "yes"
      heritage: 'Helm'
      release: 'gatekeeper'
  template:
    metadata:
      annotations:
      labels:
        app: 'gatekeeper'
        chart: 'gatekeeper'
        control-plane: controller-manager
        gatekeeper.sh/operation: webhook
        gatekeeper.sh/system: "yes"
        heritage: 'Helm'
        release: 'gatekeeper'
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: gatekeeper.sh/operation
                  operator: In
                  values:
                  - webhook
              topologyKey: kubernetes.io/hostname
            weight: 100
      automountServiceAccountToken: true
      containers:
      -
        image: openpolicyagent/gatekeeper:v3.9.0
        args:
        - --port=8443
        - --health-addr=:9090
        - --prometheus-port=8888
        - --logtostderr
        - --log-denies=false
        - --emit-admission-events=false
        - --log-level=INFO
        - --exempt-namespace=gatekeeper-system
        - --operation=webhook
        - --enable-external-data=false
        - --log-mutations=false
        - --mutation-annotations=false
        - --disable-cert-rotation=false
        - --metrics-backend=prometheus
        
        - --operation=mutation-webhook
        - --disable-opa-builtin={http.send}
        command:
        - /manager
        env:
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: CONTAINER_NAME
          value: manager
        imagePullPolicy: 'IfNotPresent'
        livenessProbe:
          httpGet:
            path: /healthz
            port: 9090
        name: manager
        ports:
        - containerPort: 8443
          name: webhook-server
          protocol: TCP
        - containerPort: 8888
          name: metrics
          protocol: TCP
        - containerPort: 9090
          name: healthz
          protocol: TCP
        readinessProbe:
          httpGet:
            path: /readyz
            port: 9090
        resources:
          limits:
            cpu: 1000m
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 256Mi
        securityContext:
          seccompProfile:
            type: RuntimeDefault
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - all
          readOnlyRootFilesystem: true
          runAsGroup: 999
          runAsNonRoot: true
          runAsUser: 1000
        volumeMounts:
        - mountPath: /certs
          name: cert
          readOnly: true
      dnsPolicy: ClusterFirst
      hostNetwork: false
      imagePullSecrets:
        []
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName:  system-cluster-critical
      securityContext:
        fsGroup: 999
        supplementalGroups:
        - 999
      serviceAccountName: gatekeeper-admin
      terminationGracePeriodSeconds: 60
      tolerations:
        []
      volumes:
      - name: cert
        secret:
          defaultMode: 420
          secretName: gatekeeper-webhook-server-cert
---
# Source: gatekeeper/templates/gatekeeper-mutating-webhook-configuration-mutatingwebhookconfiguration.yaml
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  labels:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
  name: gatekeeper-mutating-webhook-configuration
webhooks:
- admissionReviewVersions:
  - v1
  - v1beta1
  clientConfig:
    service:
      name: gatekeeper-webhook-service
      namespace: 'gatekeeper-system'
      path: /v1/mutate
  failurePolicy: Ignore
  matchPolicy: Exact
  name: mutation.gatekeeper.sh
  namespaceSelector:
    matchExpressions:
    - key: admission.gatekeeper.sh/ignore
      operator: DoesNotExist
  objectSelector: {}
  reinvocationPolicy: Never
  rules:
  - apiGroups:
    - '*'
    apiVersions:
    - '*'
    operations:
    - CREATE
    - UPDATE
    resources:
    - '*'
  sideEffects: None
  timeoutSeconds: 1
---
# Source: gatekeeper/templates/gatekeeper-validating-webhook-configuration-validatingwebhookconfiguration.yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  labels:
    app: 'gatekeeper'
    chart: 'gatekeeper'
    gatekeeper.sh/system: "yes"
    heritage: 'Helm'
    release: 'gatekeeper'
  name: gatekeeper-validating-webhook-configuration
webhooks:
- admissionReviewVersions:
  - v1
  - v1beta1
  clientConfig:
    service:
      name: gatekeeper-webhook-service
      namespace: 'gatekeeper-system'
      path: /v1/admit
  failurePolicy: Ignore
  matchPolicy: Exact
  name: validation.gatekeeper.sh
  namespaceSelector:
    matchExpressions:
    - key: admission.gatekeeper.sh/ignore
      operator: DoesNotExist
  objectSelector: {}
  rules:
  - apiGroups:
    - '*'
    apiVersions:
    - '*'
    operations:
    - CREATE
    - UPDATE
    resources:
    - '*'
    # Explicitly list all known subresources except "status" (to avoid destabilizing the cluster and increasing load on gatekeeper).
    # You can find a rough list of subresources by doing a case-sensitive search in the Kubernetes codebase for 'Subresource("'
    - 'pods/ephemeralcontainers'
    - 'pods/exec'
    - 'pods/log'
    - 'pods/eviction'
    - 'pods/portforward'
    - 'pods/proxy'
    - 'pods/attach'
    - 'pods/binding'
    - 'deployments/scale'
    - 'replicasets/scale'
    - 'statefulsets/scale'
    - 'replicationcontrollers/scale'
    - 'services/proxy'
    - 'nodes/proxy'
    # For constraints that mitigate CVE-2020-8554
    - 'services/status'
  sideEffects: None
  timeoutSeconds: 3
- admissionReviewVersions:
  - v1
  - v1beta1
  clientConfig:
    service:
      name: gatekeeper-webhook-service
      namespace: 'gatekeeper-system'
      path: /v1/admitlabel
  failurePolicy: Fail
  matchPolicy: Exact
  name: check-ignore-label.gatekeeper.sh
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - '*'
    operations:
    - CREATE
    - UPDATE
    resources:
    - namespaces
  sideEffects: None
  timeoutSeconds: 3

dwgillies-bluescape avatar Sep 28 '22 02:09 dwgillies-bluescape

Looks like this is the difference between the init container and the failure message.

init-container

  • https://gatekeeper-webhook-service.gatekeeper-system.svc/v1/admitlabel?timeout=2s

failure message

  • https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admitlabel?timeout=3s"

dwgillies-bluescape avatar Sep 28 '22 03:09 dwgillies-bluescape

kubectl delete mutatingwebhookconfigurations gatekeeper-mutating-webhook-configuration kubectl delete validatingwebhookconfigurations gatekeeper-validating-webhook-configuration kubectl delete crd -l gatekeeper.sh/system=yes helm delete gatekeeper -n gatekeeper-system kubectl delete namespace gatekeeper-system helm install -n gatekeeper-system --version v3.9.0 gatekeeper gatekeeper/gatekeeper --create-namespace --debug --timeout 900s

Running individual kubectl delete for all the gatekeeper resources could cause issues if they are not done in the right order due to dependencies. you should use helm to delete, which should delete all the gatekeeper resources on the cluster.

helm delete gatekeeper -n gatekeeper-system

For the failure message

https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admitlabel?timeout=3s

This means the gatekeeper webhook service is not ready to serve traffic from the api server and it is timed out after 3 secs as specified in the gatekeeper-validating-webhook-configuration validatingwebhookconfiguration.

I'm having a hard time reproducing the issue on a kind cluster.

ritazh avatar Sep 28 '22 05:09 ritazh

Same issue here, using v3.9.0. No difference when increasing the time out itself, as the request fails instantly due to the socket not being accessible ('connection refused'):

Internal error occurred: failed calling webhook "check-ignore-label.gatekeeper.sh": failed to call webhook: Post "https://gatekeeper-webhook-service.cattle-gatekeeper-system.svc:443/v1/admitlabel?timeout=30s": dial tcp 10.11.12.13:443: connect: connection refused

mpepping avatar Nov 23 '22 14:11 mpepping

This should be fixed with #2385. This will be available as part of helm chart in the next release (v3.11) or you can test these changes using manifest_staging/charts folder today. Please feel free to comment or re-open if this issue persists.

sozercan avatar Nov 29 '22 22:11 sozercan