triggers
triggers copied to clipboard
request.go:668] Waited for *s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/github
Expected Behavior
Event listener is working as expected
Actual Behavior
Drasticall performance degradation
1 request.go:668] Waited for 11.393899312s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/github
Steps to Reproduce the Problem
- Install Operator > 0.51.1
Additional Info
-
Kubernetes version:
Output of
kubectl version:
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-06eac09", GitCommit:"5f6d83fe4cb7febb5f4f4e39b3b2b64ebbbe3e97", GitTreeState:"clean", BuildDate:"2021-09-13T14:20:15Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
-
Tekton Pipeline version:
v0.29.0
As i can see, after tekton-trigger has been moved to knative (v 0.17.x +) it got performance degradation. All webhooks from github are timed out
Pipelines run anyway but with start delay ~ 2 minutes. This is time when EventListener is parsing webhook and is trying to match CEL condition. In logs it looks very slow
el-github-listener-interceptor-6b86d6f674-dp5hh event-listener {"level":"info","ts":"2021-12-09T09:57:20.094Z","logger":"eventlistener","caller":"sink/sink.go:301","msg":"... type push is not allowed","eventlistener":"github-listener-interceptor"...
el-github-listener-interceptor-6b86d6f674-dp5hh event-listener {"level":"info","ts":"2021-12-09T09:57:20.293Z","logger":"eventlistener","caller":"sink/sink.go:301","msg":"... type push is not allowed","eventlistener":"github-listener-interceptor"...
el-github-listener-interceptor-6b86d6f674-dp5hh event-listener {"level":"info","ts":"2021-12-09T09:57:20.694Z","logger":"eventlistener","caller":"sink/sink.go:301","msg":"... type push is not allowed","eventlistener":"github-listener-interceptor"...
el-github-listener-interceptor-6b86d6f674-dp5hh event-listener {"level":"info","ts":"2021-12-09T09:57:21.094Z","logger":"eventlistener","caller":"sink/sink.go:301","msg":"... type push is not allowed","eventlistener":"github-listener-interceptor"...
el-github-listener-interceptor-6b86d6f674-dp5hh event-listener {"level":"info","ts":"2021-12-09T09:57:21.893Z","logger":"eventlistener","caller":"sink/sink.go:301","msg":"... type push is not allowed","eventlistener":"github-listener-interceptor"...
el-github-listener-interceptor-6b86d6f674-dp5hh event-listener {"level":"info","ts":"2021-12-09T09:57:22.294Z","logger":"eventlistener","caller":"sink/sink.go:301","msg":"... type push is not allowed","eventlistener":"github-listener-interceptor"...
el-github-listener-interceptor-6b86d6f674-dp5hh event-listener {"level":"info","ts":"2021-12-09T09:57:22.894Z","logger":"eventlistener","caller":"sink/sink.go:301","msg":"... type push is not allowed","eventlistener":"github-listener-interceptor"...
As you can see it parses 2-3 conditions per second when prior versions parsed hundreds for the same time I just played today and installed different versions of tekton-operator starting 0.50.0 till 0.53.0. All versions with trigger 0.17.x are working very slow
@savitaashture @khrm @vdemeester
I think this should be a tektoncd/triggers issue though, I can move it there.
/cc @dibyom
Just to be clear - this was not an issue with Triggers v0.16?
Just to be clear - this was not an issue with Triggers v0.16?
Correct
Pipelines run anyway but with start delay ~ 2 minutes. This is time when EventListener is parsing webhook and is trying to match CEL condition. In logs it looks very slow
Hmm I wonder if we are seeing similar behavior in our own dogfooding cluster where we are on Triggers 0.17.1
Sorry for the delay. I checked our logs from the dogfooding cluster and I see many log messages for interceptor messages roughly a milsecond apart. Based on that, I do not think this is an issue with Triggers in general. Could you confirm that your cluster was not overloaded around this time?
Hello. From the moment when i opened this issue my team 2 times did fresh install of latest version of tekton operator into different fresh clusters in different aws accounts. In both cases there was the same behavior, so we downgraded version to 0.50.2 and it just works. In both cases latest AWS EKS v 1.21.x used. Node pools that were provisioned exclusively for tekton: instance_types = ["c5d.xlarge", "c5d.2xlarge", "m5d.xlarge", "m5d.2xlarge"] asg_max_size = 10 In first case we used traefik ingress and argocd 2.1.7 for tekton installation In second istio ingress (sidecars disabled) and argocd 2.2.1 for tekton installation
When I tested different versions there was ~3 minutes difference in time between installations:
- kubectl apply 0.50.2
- waiting until all pods will be up and running
- send github webhook
- works ...
- kubectl apply > 0.50.2
- waiting until all pods will be up and running
- send github webhook
- problem ...
- kubectl apply 0.50.2
- pods u&r
- github webhook
- works ...
I can provide more additional information, just let me know what you need. Thank you
This is also happening to me. Downgraded from 0.18 to 0.16 and the problem seemed to disappear. Although in my case not all github requests were failing, only some. Triggers: 0.18.0 Pipelines 0.32.1
One warning I get from the listeners pods is Sink timeout configuration is invalid, default to -1 (no timeout) not sure if this might be related to the issue.
I was thinking that if the eventlistener calls the API server for each trigger on each request to fetch the interceptor it might cause this kind of issue? Would this be a problem?
I was thinking that if the eventlistener calls the API server for each trigger on each request to fetch the interceptor it might cause this kind of issue? Would this be a problem?
That's a good guess but even in v0.16 we were making the same requests.
I wonder if this has something to do with a newer knative/pkg version
Maybe we are running into https://github.com/knative/pkg/issues/2383 and we need to port https://github.com/tektoncd/pipeline/pull/4541 to Triggers?
@joaosilva15 @qtheya could try adding a config-leader-election.yaml file like below and see if that fixes your issue?
https://github.com/tektoncd/pipeline/blob/8f568ca4af50d6c18be1d19926689b9477463271/config/config-leader-election.yaml
@dibyom fyi: because i'm using tekton operator i set parameters to TektonInstallerSet and restarted (deleted) pods to apply values to configMap and containers
apiVersion: v1
data:
leaseDuration: 60s
renewDeadline: 40s
retryPeriod: 10s
kind: ConfigMap
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
....
creationTimestamp: "2022-02-09T04:50:18Z"
labels:
app.kubernetes.io/instance: default
app.kubernetes.io/part-of: tekton-pipelines
name: config-leader-election
namespace: tekton-pipelines
ownerReferences:
- apiVersion: operator.tekton.dev/v1alpha1
blockOwnerDeletion: true
controller: true
kind: TektonInstallerSet
name: pipeline-874b5
Suggested solution doesn't help.
event-listener {"level":"info","ts":"2022-02-09T05:35:36.627Z","logger":"eventlistener","caller":"sink/sink.go:329","msg":"interceptor stopped trigger processing: rpc error: code = FailedPrecondition desc = event type status is not allowed","eventlistener":"github-listener-interceptor","namespace":"tekton-pipelines","eventlistenerUID":"...","/triggers-eventid":"...","/trigger":"****-push-dev"}
event-listener I0209 05:35:36.822562 1 request.go:668] Waited for 21.591784626s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/github
event-listener {"level":"info","ts":"2022-02-09T05:35:36.827Z","logger":"eventlistener","caller":"sink/sink.go:329","msg":"interceptor stopped trigger processing: rpc error: code = FailedPrecondition desc = event type status is not allowed","eventlistener":"github-listener-interceptor","namespace":"tekton-pipelines","eventlistenerUID":"...","/triggers-eventid":"...","/trigger":"****-push-dev"}
event-listener {"level":"info","ts":"2022-02-09T05:35:37.027Z","logger":"eventlistener","caller":"sink/sink.go:329","msg":"interceptor stopped trigger processing: rpc error: code = FailedPrecondition desc = event type status is not allowed","eventlistener":"github-listener-interceptor","namespace":"tekton-pipelines","eventlistenerUID":"...","/triggers-eventid":"...","/trigger":"****-push-dev"}
After changing the config map to that I still get the same issue. Can you explain why this would help this situation? Is that config map used by both pipelines controller and the triggers controller?
Can you explain why this would help this situation? Is that config map used by both pipelines controller and the triggers controller?
Sure - the config is used by the knative/pkg package that we use to build our controllers to configure leader election. It had a default resync period of 2 second which was causing a lot of calls to the API server leading to similar client side throttling issues.
Looks like this might be a different issue though @khrm has https://github.com/tektoncd/triggers/pull/1308/files to switch to non-default HTTP client which might help here.
Hey @dibyom forgot about this and left the triggers in version 0.16 for a while. I tried to upgrade to 0.20.1 and is clearly better than 0.18.0 but we still get timeouts due to throttling while getting clusterinterceptors
Waited for 11.592004675s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/cel
I believe this is the same issue as https://github.com/tektoncd/triggers/issues/1374
@joaosilva15 yes it seems like there are some spots where we are calling the API server directly instead of going through the lister cache.
@dibyom thanks for following up on this 🙏
@joaosilva15 are you still running into this in Triggers v0.20.2?
Hey @dibyom, it is still happening, ~I was trying to understand why and these two lines are the only ones that would most likely cause the problem.~ https://github.com/tektoncd/triggers/blob/e324592c33405efac198c02233fcef85922ca15d/pkg/sink/sink.go#L459-L460 ~but we are using the new format (or at least I think so), so the webhook key should not be defined in our case~
Scratch that, I misread the code 😛
Edit: I'll try again!
I believe our problem is we have a big event listener with a lot of triggers and when the event comes this gets called for each one https://github.com/tektoncd/triggers/blob/e324592c33405efac198c02233fcef85922ca15d/pkg/sink/sink.go#L477 and the getter method is (I think) this one, which in the get functions is not using the cache https://github.com/tektoncd/triggers/blob/e324592c33405efac198c02233fcef85922ca15d/pkg/client/injection/informers/triggers/v1alpha1/clusterinterceptor/clusterinterceptor.go#L106
This means that at every event the listener will call the API for the cluster interceptor for as many interceptors the event was evaluated against
Hmm, that's really odd - that is using a Lister to get the clusterinterceptor and the lister should utilize the informer cache
This issue is still there. EventListener is timing out. Any updated on this? @dibyom
Waited for 14.59582426s due to client-side throttling, not priority and fairness, request: GET:https://10.16.0.1:443/apis/triggers.tekton.dev/v1beta1/namespaces/devops/triggers/github-task-success
I'm using the latest release of Triggers. Any temporary workaround that we can use?
@quant-daddy which version of Triggers? And do you see this in the EL logs? Also, do you have any stats around the number of triggers/requests the EL is trying to process?
Hi @dibyom I'm using the latest version of Triggers (0.21). The issue is mainly arising because I'm using the event listener as the cloudEventsUri so every pipelinerun /taskrun status change is hitting it. This leads a lot of requests of course. Processing each request is also expensive because the listener has to call the ClusterInterceptor for each trigger processing. Because of API throttling, hitting ClusterInterceptor too frequently is a bad idea. To give you an idea: let's say my interceptor is using 15 triggers (including github status trigger). Let's say each pipeline run has 5 taskrun and each triggers 15 clusterinterceptor API call because of the status change. So we have around 75 calls per pipeline run. What I also noticed is that the throttling triggers some type of looping behaviors. The Listener Interceptor continues to be triggered once the API throttling kicks in: may be there is some retry mechanism somewhere?
A temporary workaround I'm thinking of is having a separate service for cloudeventsink. It makes more sense to have a constantly running pod for it and it would significantly reduce the load on the interceptor. A long term solution is probably not using a CRD for interceptor and having it inside the eventlistener itself.
Happy to jump on a zoom call if that helps in finding the issue.
This issue has disappeared after I started using a separate service as cloud events sink.
Can you join our next WG call? Or send us the spec, environment and load which causes this issue?
Hi @khrm the issue was mainly caused by overload of cloudevents created by PipelineRun/TaskRun status changes. I was using the event interceptor as the cloud event sink. This lead to almost 60-70 events every pipeline run, and hence a call to the API server to evaluate the ClusterInterceptor logic (this is probably because ClusterInterceptor is a CRD). Once I used a separate service as the sink, things started working well. Happy to join the WG call if it would help.
Hi, I am experiencing the same issue here so I will describe our setup in hopes that it helps identify the issue. The throttling we experience looks more severe than what others have posted. I see the following in the EventListener logs:
Waited for 1h44m57.05721051s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/bdx-tekton-interceptors
This happens when I move from v0.16.1 to v0.18.1 (I cannot really test any higher versions on my setup right now unfortunately).
We have a single EventListener with about 60 to 100 Triggers defined in the test setup. Each trigger has the same interceptor defined which references the bdx-tekton-interceptors ClusterInterceptor as seen in the above log. We push about 10 to 20 events into the EventListener per minute. The throttling just progressively seems to get worse and we get to a point where it takes about and 1 to 2 hours of retrying events to kickoff triggers.
Rolling back to v0.16.1 resolves issue.
Hi, I was doing some load-testing with my Pipeline in my cluster to see if the cluster can handle my expected load in terms of running PipelinesRuns. Therefore I do 20 curl's right away (to get a "step-response") to a single EventListener, which is responsible for at least 23 Triggers. The first 2-4 PipelineRuns are created directly and the remaining ones are popping up after each other with around 2-3 minutes intervals. After ~20 minutes all PipelineRuns finally run.
Worth to mention: While doing the load-test the cluster is already running a significant amount of PipelineRuns.
When digging into this problem I saw that curl return exit-code 22 which refers to getting >400 back from the EventListener. After looking into the EventListener logs I found the same errors as above mentioned. Currently I can not get the exact error, but can provide if helpful. It's something like this:
Waited for XXXs due to client-side throttling, not priority and fairness, request: GET:<SOME-URL>
Triggers Version v0.23.1 still present
Waited for 29.020773428m due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/apis/triggers.tekton.dev/v1beta1/namespaces/tekton/eventlisteners/events-listener
A bit more errors
I0414 16:01:06.384281 1 request.go:682] Waited for 42m40.39465328s due to client-side throttling, not priority and fairness, request: GET:https://10.252.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/cel
I0414 16:01:06.644570 1 request.go:682] Waited for 41m22.388485646s due to client-side throttling, not priority and fairness, request: GET:https://10.252.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/cel
I0414 16:01:15.599032 1 request.go:682] Waited for 41m25.193183609s due to client-side throttling, not priority and fairness, request: GET:https://10.252.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/cel
I0414 16:01:15.971174 1 request.go:682] Waited for 41m47.393266869s due to client-side throttling, not priority and fairness, request: GET:https://10.252.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/cel
I0414 16:01:16.583722 1 request.go:682] Waited for 42m41.793147635s due to client-side throttling, not priority and fairness, request: GET:https://10.252.0.1:443/apis/triggers.tekton.dev/v1alpha1/clusterinterceptors/cel
Hello @dibyom
I've been taking a look into this, and I think the issue is that we are using the DynamicInformers, which do not cache these responses.
As a simple example, I've patched the latest triggers release, and added the following code here https://github.com/tektoncd/triggers/blob/v0.23.1/cmd/eventlistenersink/main.go#L68
ictx, informers := injection.Default.SetupInformers(ctx, cfg)
if informers != nil {
if err := controller.StartInformers(ctx.Done(), informers...); err != nil {
panic(fmt.Sprint("Failed to start informers - ", err))
}
}
ctx = ictx
This seems to overwrite the dynamic informers registered in ctx since we call the withInformer factory that writes to the same Key in the context. Now our entries (in our case, mostly clusterinterceptors) seem to be cached now. This has been running for a while in our cluster, and we have had no issues with client-side throttling so far, whereas before we had them on pretty much any event that was received due to the fact that we have a lot of triggers associated with this eventlistener.
Can you confirm if this is the right approach?
Thanks
@miguelvalerio that's a great find! thanks for the explanation...Would you be open to contributing that change as in a PR?
Can you confirm if this is the right approach?
That sounds right. I was under the impression that something was already calling injection.Default.SetupInformers. But seems like we do need to call it manually.