pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

webhook_request_latencies_bucket metric keeps adding new data series and became unusable

Open r0bj opened this issue 5 years ago • 34 comments

Expected Behavior

Prometheus metric webhook_request_latencies_bucket is usable in real environment, don't add new data series forever. Prometheus is able to query that metric.

Actual Behavior

Prometheus metric webhook_request_latencies_bucket creates so many data series that is practically impossible to query in prometheus (too much data). It keeps adding new series while it's running so number of series increase forever. Restart pod tekton-pipelines-webhook resets number of series and fixes issue.

Steps to Reproduce the Problem

Run tekton-pipelines-webhook.

Additional Info

  • Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.4", GitCommit:"c96aede7b5205121079932896c4ad89bb93260af", GitTreeState:"clean", BuildDate:"2020-06-17T11:41:22Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.4", GitCommit:"c96aede7b5205121079932896c4ad89bb93260af", GitTreeState:"clean", BuildDate:"2020-06-17T11:33:59Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:
Client version: 0.12.0
Pipeline version: v0.15.0
Triggers version: v0.7.0

r0bj avatar Sep 06 '20 05:09 r0bj

I can take a look at it if no one else is. But it may take a longer time for me.

ywluogg avatar Sep 10 '20 14:09 ywluogg

@ImJasonH will this be suitable as a good first issue?

ywluogg avatar Sep 15 '20 03:09 ywluogg

/assign ywluogg

cc @NavidZ since this relates to metrics

imjasonh avatar Sep 16 '20 14:09 imjasonh

Dropping this here for context. The webhook_request_latencies_bucket (and others) metrics is heavily influenced by the labels in question here: https://github.com/knative/pkg/pull/1464/files

Removing the labels in that pull request might help reduce the number of unique webhook_request_latencies_bucket metrics the webhook has to manage.

Aside from this, I don't know if theres a way to configure the metrics code to purge metrics from the in memory store after a period of time. This would help too. Most of the time, the in memory stuff is sent to a backend like Prometheus, stack driver, etc anyway.

eddie4941 avatar Sep 16 '20 19:09 eddie4941

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Dec 15 '20 21:12 tekton-robot

/remove-lifecycle stale

r0bj avatar Dec 15 '20 21:12 r0bj

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar May 04 '21 03:05 tekton-robot

/remove-lifecycle stale

r0bj avatar May 04 '21 04:05 r0bj

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Oct 15 '21 00:10 tekton-robot

/remove-lifecycle stale

r0bj avatar Oct 15 '21 01:10 r0bj

@ywluogg are you still looking into this?

@vdemeester looks like this issue would be addressed by TEP-0073: Simplify metrics, right?

jerop avatar Nov 15 '21 20:11 jerop

Hi jerop@ I'm not looking into this anymore. Please unassign me. Thanks!

ywluogg avatar Nov 15 '21 21:11 ywluogg

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Feb 13 '22 21:02 tekton-robot

/remove-lifecycle stale

r0bj avatar Feb 13 '22 21:02 r0bj

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar May 23 '22 19:05 tekton-robot

/assign @QuanZhang-William

QuanZhang-William avatar Jul 29 '22 20:07 QuanZhang-William

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Jan 04 '23 21:01 tekton-robot

/remove-lifecycle stale

r0bj avatar Jan 04 '23 21:01 r0bj

We have a very similar problem. Many metrics have a resource_namespace label. In our case, these namespaces have randomly generated names and live for a short time. This leads to a very high cardinality for the resource_namespace label in about one week. That huge number of series results in a growing memory consumption.

I agree with @eddie4941 that configuring the metrics code to purge metrics from the in memory store after a period of time would help.

syurevich avatar May 25 '23 09:05 syurevich

Based on the discussion in the API WG: /assign @khrm

pritidesai avatar Jul 17 '23 17:07 pritidesai

@pritidesai: GitHub didn't allow me to assign the following users: khrm.

Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

Based on the discussion in the API WG: /assign @khrm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot avatar Jul 17 '23 17:07 tekton-robot

/assign @khrm

khrm avatar Jul 17 '23 18:07 khrm

@pritidesai This was fixed by https://github.com/knative/pkg/pull/1464

So we can close this.

/close

khrm avatar Jul 31 '23 16:07 khrm

@khrm: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

@pritidesai This was fixed by https://github.com/knative/pkg/pull/1464

So we can close this.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot avatar Jul 31 '23 16:07 tekton-robot

@khrm not only resource_name label but also resource_namespace label can contribute to this "high cardinality" issue. To fix it for every use case one would need to purge metrics from the in memory store after a period of time.

syurevich avatar Aug 01 '23 10:08 syurevich

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Oct 30 '23 10:10 tekton-robot

This issue is still relevant. See this comment as well as this.

syurevich avatar Nov 08 '23 08:11 syurevich

We have the same issue, too.

I have a proposal for knative/pkg at https://github.com/knative/pkg/pull/2931.

zhouhaibing089 avatar Jan 10 '24 20:01 zhouhaibing089

knative/pkg now gives the option to exclude arbitrary tags. I assume the next action item is to bump knative/pkg and customize the webhook options.

zhouhaibing089 avatar Apr 02 '24 04:04 zhouhaibing089