gatekeeper icon indicating copy to clipboard operation
gatekeeper copied to clipboard

Gatekeeper high response time after adding --max-serving-threads and GOMAXPROCS

Open inboxamitraj opened this issue 3 years ago • 8 comments

What steps did you take and what happened: Gatekeeper was consuming a lot of memory(around 20GB during peak hours) and Gatekeeer controller manager pods were OOMing. to fix this issue, we applied the fix mentioned here https://github.com/open-policy-agent/gatekeeper/issues/1907

  • added --max-serving-threads flag to max cpu count(3 in our case)
  • added GOMAXPROCS to max cpu(3 in our case).

This indeed brought down the memory consumption and no OOMing restarts.
memory-under-control

However we observed that the web-hook response time[gatekeeper_validation_request_duration_seconds] has gone up very high,around 3 seconds, compared to around 50ms before this fix.

webhook-response-time-went-high-after-fix

What did you expect to happen: memory/CPU under control without compromising webhook response time.

Anything else you would like to add: We have around 74 contraints and 32 templates.

Environment:

  • Gatekeeper version: 3.8.1
  • Kubernetes version: (use kubectl version): 1.20

inboxamitraj avatar Sep 07 '22 08:09 inboxamitraj

What kind of QPS is G8r getting?

My first instincts are:

  • The footprints need to be larger (more pods and/or larger pods)
  • Further tuning might be able to squeeze more out of the same footprint by raising max-serving-threads and GOMAXPROCS over the cpu limit. I'd try raising max-serving-threads first.

Do you have any constraints that use blocking calls (e.g. anything that uses http.send() or external data)?

maxsmythe avatar Sep 08 '22 03:09 maxsmythe

Hi @maxsmythe, Thank you for your reply. Sorry, I am not sure how to get QPS for G8r. I tried to find various prometheus metrics but unfortunately I don't find QPS for G8r.

  • Yes, the footprints is large(we run one cron job that creates around 30 pods) when we see high response time from G8r.
  • I tried to increase max-serving-threads to 7, this did not help.
  • I tried to increase GOMAXPROCS to 7, this also did not help.
  • my current configuration is : max CPU=5, max-serving-threads=7, GOMAXPROCS=7

unfortunately, this configuration did not help.

One possible workaround for now is to disable the cron job, but this will only fix the problem temporarily. We are expecting workload to increase in the future and we would like to scale G8r for the larger footprint.

inboxamitraj avatar Sep 21 '22 05:09 inboxamitraj

The QPS would be inferred by analyzing the gatekeeper_validation_request_count metric (calculating delta and dividing by the sampling interval). It looks like you have a dashboard system, does it allow graphing those kinds of expressions?

Here is a stack overflow answer that seems on-point:

https://stackoverflow.com/questions/64043624/how-prometheus-statistics-qps-in-multi-process

Knowing this number would help to figure out if the QPS served/pod seems reasonable.

You can definitely scale Gatekeeper by adding pods, this section of the doc talks about the implications of that: https://open-policy-agent.github.io/gatekeeper/website/docs/failing-closed#impact-of-scaling-nodes

you mention you have 32 templates... are any of them referential (i.e. do they have data.inventory in their Rego source code)?

maxsmythe avatar Sep 22 '22 23:09 maxsmythe

Thank you @maxsmythe

  • QPS is around 17 during peak hours. its around 10 during normal hours. Screenshot 2022-09-26 at 4 43 01 PM

  • Yes, we are using data.inventory in Rego sources, around 10-15 templates are using data.inventory

  • I am adding other gatekeeper monitoring metrics as well- Gatekeeper-grafana-1.pdf Gatekeeper-grafana-2.pdf Gatekeeper-grafana-3.pdf

  • I also noticed that gatekeeper controller pods are flooded with this error- {"level":"error","ts":1664181904.8118215,"logger":"webhook","msg":"error executing query","hookType":"validation","error":"serving context canceled, aborting request","stacktrace":"sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Handle\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/webhook.go:146\nsigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).ServeHTTP\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/http.go:99\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerInFlight.func1\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:40\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:117\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:84\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2425\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2879\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1930"}

inboxamitraj avatar Sep 26 '22 08:09 inboxamitraj

As an experiment (if you are able to do this), does disabling the sync config fix the issue?

I'm wondering if it's the lock contention issue addressed here:

https://github.com/open-policy-agent/gatekeeper/pull/2272

maxsmythe avatar Sep 27 '22 02:09 maxsmythe

Hi @maxsmythe unfortunately disabling sync config is not an option for us. what we are thinking now is to focus on optimising our cron jobs and see if that improves performance because only during cronjob runs we face high latency from G8r.

Thank you

inboxamitraj avatar Oct 04 '22 01:10 inboxamitraj

one interesting and intriguing thing I noticed is that latency- Median Request Duration(gatekeeper_validation_request_duration_seconds_bucket) does not go beyond 3 seconds. Screenshot 2022-10-04 at 9 17 50 AM

Is this because we have set timeoutSeconds: 30 ?

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: gatekeeper-validating-webhook-configuration
webhooks:
- failurePolicy: Fail
  name: validation.gatekeeper.sh
  timeoutSeconds: 30
  rules:
  - apiGroups:
    - '*'
    apiVersions:
    - '*'
    operations:
    - CREATE
    - UPDATE
    - DELETE
    resources:
    - '*'

inboxamitraj avatar Oct 04 '22 01:10 inboxamitraj

3 seconds is the current maximum latency the distribution is configured to observe:

https://github.com/open-policy-agent/gatekeeper/blob/ab939829c8296634149eb68cb030d9fc8b6d2a21/pkg/webhook/stats_reporter.go#L86-L92

If #2272 isn't fixing the performance issues, I also have hopes that https://github.com/open-policy-agent/frameworks/pull/252 will help.

maxsmythe avatar Oct 05 '22 20:10 maxsmythe

Thank you for your inputs @maxsmythe, highly appreciate it. we are currently fine-tuning our cronjobs. seems this fix was just released in 3.9.1, we will wait for n-1 release before we upgrade it in production.

inboxamitraj avatar Oct 12 '22 11:10 inboxamitraj

Avoiding copying data.inventory is fixed in 3.9.2

maxsmythe avatar Oct 13 '22 00:10 maxsmythe

Thank you for pointing that out. appreciate it. thank you

inboxamitraj avatar Oct 13 '22 10:10 inboxamitraj

I see you re-opened the issue, is there anything outstanding that you need?

maxsmythe avatar Oct 27 '22 02:10 maxsmythe

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 26 '22 03:12 stale[bot]