triggers icon indicating copy to clipboard operation
triggers copied to clipboard

request.go:668] Waited for *s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/github

Open qtheya opened this issue 2 years ago • 21 comments

Expected Behavior

Event listener is working as expected

Actual Behavior

Drasticall performance degradation 1 request.go:668] Waited for 11.393899312s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/github

Steps to Reproduce the Problem

  1. Install Operator > 0.51.1

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-06eac09", GitCommit:"5f6d83fe4cb7febb5f4f4e39b3b2b64ebbbe3e97", GitTreeState:"clean", BuildDate:"2021-09-13T14:20:15Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:

    v0.29.0

As i can see, after tekton-trigger has been moved to knative (v 0.17.x +) it got performance degradation. All webhooks from github are timed out image Pipelines run anyway but with start delay ~ 2 minutes. This is time when EventListener is parsing webhook and is trying to match CEL condition. In logs it looks very slow

el-github-listener-interceptor-6b86d6f674-dp5hh event-listener {"level":"info","ts":"2021-12-09T09:57:20.094Z","logger":"eventlistener","caller":"sink/sink.go:301","msg":"... type push is not allowed","eventlistener":"github-listener-interceptor"...
el-github-listener-interceptor-6b86d6f674-dp5hh event-listener {"level":"info","ts":"2021-12-09T09:57:20.293Z","logger":"eventlistener","caller":"sink/sink.go:301","msg":"... type push is not allowed","eventlistener":"github-listener-interceptor"...
el-github-listener-interceptor-6b86d6f674-dp5hh event-listener {"level":"info","ts":"2021-12-09T09:57:20.694Z","logger":"eventlistener","caller":"sink/sink.go:301","msg":"... type push is not allowed","eventlistener":"github-listener-interceptor"...
el-github-listener-interceptor-6b86d6f674-dp5hh event-listener {"level":"info","ts":"2021-12-09T09:57:21.094Z","logger":"eventlistener","caller":"sink/sink.go:301","msg":"... type push is not allowed","eventlistener":"github-listener-interceptor"...
el-github-listener-interceptor-6b86d6f674-dp5hh event-listener {"level":"info","ts":"2021-12-09T09:57:21.893Z","logger":"eventlistener","caller":"sink/sink.go:301","msg":"... type push is not allowed","eventlistener":"github-listener-interceptor"...
el-github-listener-interceptor-6b86d6f674-dp5hh event-listener {"level":"info","ts":"2021-12-09T09:57:22.294Z","logger":"eventlistener","caller":"sink/sink.go:301","msg":"... type push is not allowed","eventlistener":"github-listener-interceptor"...
el-github-listener-interceptor-6b86d6f674-dp5hh event-listener {"level":"info","ts":"2021-12-09T09:57:22.894Z","logger":"eventlistener","caller":"sink/sink.go:301","msg":"... type push is not allowed","eventlistener":"github-listener-interceptor"...

As you can see it parses 2-3 conditions per second when prior versions parsed hundreds for the same time I just played today and installed different versions of tekton-operator starting 0.50.0 till 0.53.0. All versions with trigger 0.17.x are working very slow

qtheya avatar Dec 09 '21 10:12 qtheya

@savitaashture @khrm @vdemeester

sm43 avatar Dec 09 '21 12:12 sm43

I think this should be a tektoncd/triggers issue though, I can move it there. /cc @dibyom

vdemeester avatar Dec 09 '21 12:12 vdemeester

Just to be clear - this was not an issue with Triggers v0.16?

dibyom avatar Dec 09 '21 16:12 dibyom

Just to be clear - this was not an issue with Triggers v0.16?

Correct

qtheya avatar Dec 10 '21 04:12 qtheya

Pipelines run anyway but with start delay ~ 2 minutes. This is time when EventListener is parsing webhook and is trying to match CEL condition. In logs it looks very slow

Hmm I wonder if we are seeing similar behavior in our own dogfooding cluster where we are on Triggers 0.17.1

dibyom avatar Dec 10 '21 21:12 dibyom

Sorry for the delay. I checked our logs from the dogfooding cluster and I see many log messages for interceptor messages roughly a milsecond apart. Based on that, I do not think this is an issue with Triggers in general. Could you confirm that your cluster was not overloaded around this time?

dibyom avatar Jan 07 '22 22:01 dibyom

Hello. From the moment when i opened this issue my team 2 times did fresh install of latest version of tekton operator into different fresh clusters in different aws accounts. In both cases there was the same behavior, so we downgraded version to 0.50.2 and it just works. In both cases latest AWS EKS v 1.21.x used. Node pools that were provisioned exclusively for tekton: instance_types = ["c5d.xlarge", "c5d.2xlarge", "m5d.xlarge", "m5d.2xlarge"] asg_max_size = 10 In first case we used traefik ingress and argocd 2.1.7 for tekton installation In second istio ingress (sidecars disabled) and argocd 2.2.1 for tekton installation

When I tested different versions there was ~3 minutes difference in time between installations:

  1. kubectl apply 0.50.2
  2. waiting until all pods will be up and running
  3. send github webhook
  4. works ...
  5. kubectl apply > 0.50.2
  6. waiting until all pods will be up and running
  7. send github webhook
  8. problem ...
  9. kubectl apply 0.50.2
  10. pods u&r
  11. github webhook
  12. works ...

I can provide more additional information, just let me know what you need. Thank you

qtheya avatar Jan 08 '22 05:01 qtheya

This is also happening to me. Downgraded from 0.18 to 0.16 and the problem seemed to disappear. Although in my case not all github requests were failing, only some. Triggers: 0.18.0 Pipelines 0.32.1

One warning I get from the listeners pods is Sink timeout configuration is invalid, default to -1 (no timeout) not sure if this might be related to the issue.

I was thinking that if the eventlistener calls the API server for each trigger on each request to fetch the interceptor it might cause this kind of issue? Would this be a problem?

joaosilva15 avatar Feb 08 '22 18:02 joaosilva15

I was thinking that if the eventlistener calls the API server for each trigger on each request to fetch the interceptor it might cause this kind of issue? Would this be a problem?

That's a good guess but even in v0.16 we were making the same requests.

I wonder if this has something to do with a newer knative/pkg version

Maybe we are running into https://github.com/knative/pkg/issues/2383 and we need to port https://github.com/tektoncd/pipeline/pull/4541 to Triggers?

@joaosilva15 @qtheya could try adding a config-leader-election.yaml file like below and see if that fixes your issue?

https://github.com/tektoncd/pipeline/blob/8f568ca4af50d6c18be1d19926689b9477463271/config/config-leader-election.yaml

dibyom avatar Feb 08 '22 20:02 dibyom

@dibyom fyi: because i'm using tekton operator i set parameters to TektonInstallerSet and restarted (deleted) pods to apply values to configMap and containers

apiVersion: v1
data:
  leaseDuration: 60s
  renewDeadline: 40s
  retryPeriod: 10s
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
....
  creationTimestamp: "2022-02-09T04:50:18Z"
  labels:
    app.kubernetes.io/instance: default
    app.kubernetes.io/part-of: tekton-pipelines
  name: config-leader-election
  namespace: tekton-pipelines
  ownerReferences:
  - apiVersion: operator.tekton.dev/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: TektonInstallerSet
    name: pipeline-874b5

Suggested solution doesn't help.

event-listener {"level":"info","ts":"2022-02-09T05:35:36.627Z","logger":"eventlistener","caller":"sink/sink.go:329","msg":"interceptor stopped trigger processing: rpc error: code = FailedPrecondition desc = event type status is not allowed","eventlistener":"github-listener-interceptor","namespace":"tekton-pipelines","eventlistenerUID":"...","/triggers-eventid":"...","/trigger":"****-push-dev"}
event-listener I0209 05:35:36.822562       1 request.go:668] Waited for 21.591784626s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/github
event-listener {"level":"info","ts":"2022-02-09T05:35:36.827Z","logger":"eventlistener","caller":"sink/sink.go:329","msg":"interceptor stopped trigger processing: rpc error: code = FailedPrecondition desc = event type status is not allowed","eventlistener":"github-listener-interceptor","namespace":"tekton-pipelines","eventlistenerUID":"...","/triggers-eventid":"...","/trigger":"****-push-dev"}
event-listener {"level":"info","ts":"2022-02-09T05:35:37.027Z","logger":"eventlistener","caller":"sink/sink.go:329","msg":"interceptor stopped trigger processing: rpc error: code = FailedPrecondition desc = event type status is not allowed","eventlistener":"github-listener-interceptor","namespace":"tekton-pipelines","eventlistenerUID":"...","/triggers-eventid":"...","/trigger":"****-push-dev"}

qtheya avatar Feb 09 '22 05:02 qtheya

After changing the config map to that I still get the same issue. Can you explain why this would help this situation? Is that config map used by both pipelines controller and the triggers controller?

joaosilva15 avatar Feb 09 '22 10:02 joaosilva15

Can you explain why this would help this situation? Is that config map used by both pipelines controller and the triggers controller?

Sure - the config is used by the knative/pkg package that we use to build our controllers to configure leader election. It had a default resync period of 2 second which was causing a lot of calls to the API server leading to similar client side throttling issues.

Looks like this might be a different issue though @khrm has https://github.com/tektoncd/triggers/pull/1308/files to switch to non-default HTTP client which might help here.

dibyom avatar Feb 09 '22 21:02 dibyom

Hey @dibyom forgot about this and left the triggers in version 0.16 for a while. I tried to upgrade to 0.20.1 and is clearly better than 0.18.0 but we still get timeouts due to throttling while getting clusterinterceptors

Waited for 11.592004675s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/cel

I believe this is the same issue as https://github.com/tektoncd/triggers/issues/1374

joaosilva15 avatar Jul 04 '22 11:07 joaosilva15

@joaosilva15 yes it seems like there are some spots where we are calling the API server directly instead of going through the lister cache.

dibyom avatar Jul 07 '22 19:07 dibyom

@dibyom thanks for following up on this 🙏

joaosilva15 avatar Jul 08 '22 09:07 joaosilva15

@joaosilva15 are you still running into this in Triggers v0.20.2?

dibyom avatar Aug 10 '22 15:08 dibyom

Hey @dibyom, it is still happening, ~I was trying to understand why and these two lines are the only ones that would most likely cause the problem.~ https://github.com/tektoncd/triggers/blob/e324592c33405efac198c02233fcef85922ca15d/pkg/sink/sink.go#L459-L460 ~but we are using the new format (or at least I think so), so the webhook key should not be defined in our case~

Scratch that, I misread the code 😛

Edit: I'll try again!

I believe our problem is we have a big event listener with a lot of triggers and when the event comes this gets called for each one https://github.com/tektoncd/triggers/blob/e324592c33405efac198c02233fcef85922ca15d/pkg/sink/sink.go#L477 and the getter method is (I think) this one, which in the get functions is not using the cache https://github.com/tektoncd/triggers/blob/e324592c33405efac198c02233fcef85922ca15d/pkg/client/injection/informers/triggers/v1alpha1/clusterinterceptor/clusterinterceptor.go#L106

This means that at every event the listener will call the API for the cluster interceptor for as many interceptors the event was evaluated against

joaosilva15 avatar Aug 11 '22 10:08 joaosilva15

Hmm, that's really odd - that is using a Lister to get the clusterinterceptor and the lister should utilize the informer cache

dibyom avatar Aug 11 '22 15:08 dibyom

This issue is still there. EventListener is timing out. Any updated on this? @dibyom

Waited for 14.59582426s due to client-side throttling, not priority and fairness, request: GET:https://10.16.0.1:443/apis/triggers.tekton.dev/v1beta1/namespaces/devops/triggers/github-task-success

I'm using the latest release of Triggers. Any temporary workaround that we can use?

quant-daddy avatar Oct 13 '22 15:10 quant-daddy

@quant-daddy which version of Triggers? And do you see this in the EL logs? Also, do you have any stats around the number of triggers/requests the EL is trying to process?

dibyom avatar Oct 13 '22 16:10 dibyom

Hi @dibyom I'm using the latest version of Triggers (0.21). The issue is mainly arising because I'm using the event listener as the cloudEventsUri so every pipelinerun /taskrun status change is hitting it. This leads a lot of requests of course. Processing each request is also expensive because the listener has to call the ClusterInterceptor for each trigger processing. Because of API throttling, hitting ClusterInterceptor too frequently is a bad idea. To give you an idea: let's say my interceptor is using 15 triggers (including github status trigger). Let's say each pipeline run has 5 taskrun and each triggers 15 clusterinterceptor API call because of the status change. So we have around 75 calls per pipeline run. What I also noticed is that the throttling triggers some type of looping behaviors. The Listener Interceptor continues to be triggered once the API throttling kicks in: may be there is some retry mechanism somewhere?

A temporary workaround I'm thinking of is having a separate service for cloudeventsink. It makes more sense to have a constantly running pod for it and it would significantly reduce the load on the interceptor. A long term solution is probably not using a CRD for interceptor and having it inside the eventlistener itself.

Happy to jump on a zoom call if that helps in finding the issue.

quant-daddy avatar Oct 13 '22 17:10 quant-daddy

This issue has disappeared after I started using a separate service as cloud events sink.

quant-daddy avatar Oct 15 '22 14:10 quant-daddy

Can you join our next WG call? Or send us the spec, environment and load which causes this issue?

khrm avatar Oct 19 '22 15:10 khrm

Hi @khrm the issue was mainly caused by overload of cloudevents created by PipelineRun/TaskRun status changes. I was using the event interceptor as the cloud event sink. This lead to almost 60-70 events every pipeline run, and hence a call to the API server to evaluate the ClusterInterceptor logic (this is probably because ClusterInterceptor is a CRD). Once I used a separate service as the sink, things started working well. Happy to join the WG call if it would help.

quant-daddy avatar Oct 19 '22 15:10 quant-daddy

Hi, I am experiencing the same issue here so I will describe our setup in hopes that it helps identify the issue. The throttling we experience looks more severe than what others have posted. I see the following in the EventListener logs:

Waited for 1h44m57.05721051s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/bdx-tekton-interceptors

This happens when I move from v0.16.1 to v0.18.1 (I cannot really test any higher versions on my setup right now unfortunately).

We have a single EventListener with about 60 to 100 Triggers defined in the test setup. Each trigger has the same interceptor defined which references the bdx-tekton-interceptors ClusterInterceptor as seen in the above log. We push about 10 to 20 events into the EventListener per minute. The throttling just progressively seems to get worse and we get to a point where it takes about and 1 to 2 hours of retrying events to kickoff triggers.

Rolling back to v0.16.1 resolves issue.

kbristow avatar Feb 09 '23 14:02 kbristow

Hi, I was doing some load-testing with my Pipeline in my cluster to see if the cluster can handle my expected load in terms of running PipelinesRuns. Therefore I do 20 curl's right away (to get a "step-response") to a single EventListener, which is responsible for at least 23 Triggers. The first 2-4 PipelineRuns are created directly and the remaining ones are popping up after each other with around 2-3 minutes intervals. After ~20 minutes all PipelineRuns finally run.

Worth to mention: While doing the load-test the cluster is already running a significant amount of PipelineRuns.

When digging into this problem I saw that curl return exit-code 22 which refers to getting >400 back from the EventListener. After looking into the EventListener logs I found the same errors as above mentioned. Currently I can not get the exact error, but can provide if helpful. It's something like this:

Waited for XXXs due to client-side throttling, not priority and fairness, request: GET:<SOME-URL>

seternate avatar Mar 03 '23 13:03 seternate

Triggers Version v0.23.1 still present

Waited for 29.020773428m due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/apis/triggers.tekton.dev/v1beta1/namespaces/tekton/eventlisteners/events-listener

mike-serchenia avatar Apr 10 '23 16:04 mike-serchenia

A bit more errors

I0414 16:01:06.384281       1 request.go:682] Waited for 42m40.39465328s due to client-side throttling, not priority and fairness, request: GET:https://10.252.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/cel
I0414 16:01:06.644570       1 request.go:682] Waited for 41m22.388485646s due to client-side throttling, not priority and fairness, request: GET:https://10.252.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/cel
I0414 16:01:15.599032       1 request.go:682] Waited for 41m25.193183609s due to client-side throttling, not priority and fairness, request: GET:https://10.252.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/cel
I0414 16:01:15.971174       1 request.go:682] Waited for 41m47.393266869s due to client-side throttling, not priority and fairness, request: GET:https://10.252.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/cel
I0414 16:01:16.583722       1 request.go:682] Waited for 42m41.793147635s due to client-side throttling, not priority and fairness, request: GET:https://10.252.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/cel

mike-serchenia avatar Apr 14 '23 16:04 mike-serchenia

Hello @dibyom I've been taking a look into this, and I think the issue is that we are using the DynamicInformers, which do not cache these responses. As a simple example, I've patched the latest triggers release, and added the following code here https://github.com/tektoncd/triggers/blob/v0.23.1/cmd/eventlistenersink/main.go#L68

	ictx, informers := injection.Default.SetupInformers(ctx, cfg)
	if informers != nil {
		if err := controller.StartInformers(ctx.Done(), informers...); err != nil {
			panic(fmt.Sprint("Failed to start informers - ", err))
		}
	}
	ctx = ictx

This seems to overwrite the dynamic informers registered in ctx since we call the withInformer factory that writes to the same Key in the context. Now our entries (in our case, mostly clusterinterceptors) seem to be cached now. This has been running for a while in our cluster, and we have had no issues with client-side throttling so far, whereas before we had them on pretty much any event that was received due to the fact that we have a lot of triggers associated with this eventlistener.

Can you confirm if this is the right approach?

Thanks

miguelvalerio avatar Apr 20 '23 19:04 miguelvalerio

@miguelvalerio that's a great find! thanks for the explanation...Would you be open to contributing that change as in a PR?

Can you confirm if this is the right approach?

That sounds right. I was under the impression that something was already calling injection.Default.SetupInformers. But seems like we do need to call it manually.

dibyom avatar Apr 21 '23 19:04 dibyom