pipeline
pipeline copied to clipboard
Tekton unable to handle PipelineRuns too big
Expected Behavior
Tekton should be able to run PipelineRuns that are big or at least give a proper error.
Actual Behavior
PipelineRun gets stuck in the same status in the cluster, not respecting any timeouts.
Steps to Reproduce the Problem
- Create a PipelineRun with a lot of embedded tasks and status
- Watch the tekton controller logs
Versions Info
Kubernetes Version (output of `kubectl version -o yaml`)
$ kubectl version -o yaml
clientVersion:
buildDate: "2022-10-12T10:47:25Z"
compiler: gc
gitCommit: 434bfd82814af038ad94d62ebe59b133fcb50506
gitTreeState: clean
gitVersion: v1.25.3
goVersion: go1.19.2
major: "1"
minor: "25"
platform: darwin/amd64
kustomizeVersion: v4.5.7
serverVersion:
buildDate: "2022-11-29T18:41:42Z"
compiler: gc
gitCommit: 52e500d139bdef42fbc4540c357f0565c7867a81
gitTreeState: clean
gitVersion: v1.22.16-eks-ffeb93d
goVersion: go1.16.15
major: "1"
minor: 22+
platform: linux/amd64
Tekton Pipeline Version (output of `tkn version`)
Client version: 0.26.0
Pipeline version: v0.35.1
Triggers version: v0.20.0
Dashboard version: v0.26.0
Additional Info
- We (Nubank) manage a very large CICD cluster (we reached over 700k taskruns/week), and we've built some abstractions over Tekton to make it easier for our users to declare new pipelines
- We started to observe that some pipelineruns were getting stuck in our cluster, not finishing even hours after the timeout.
- Investigating the logs, we found these (im formatting to become easier to read):
{
caller: pipelinerun/reconciler.go:268
commit: 422a468
error: etcdserver: request is too large
knative.dev/controller: github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler
knative.dev/key: pipelines/itaipu-tests-main-run-ie3c7
knative.dev/kind: tekton.dev.PipelineRun
knative.dev/traceid: d1c3c461-cd4e-4656-a913-900fede9ba6e
level: warn
logger: tekton-pipelines-controller
msg: Failed to update resource status
targetMethod: ReconcileKind
ts: 2023-01-18T13:54:54.256Z
}
{
caller: controller/controller.go:566
commit: 422a468
duration: 14.036001784
error: etcdserver: request is too large
knative.dev/controller: github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler
knative.dev/key: pipelines/itaipu-tests-main-run-ie3c7
knative.dev/kind: tekton.dev.PipelineRun
knative.dev/traceid: d1c3c461-cd4e-4656-a913-900fede9ba6e
level: error
logger: tekton-pipelines-controller
msg: Reconcile error
stacktrace: github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).handleErr
github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:566
github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem
github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:543
github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3
github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:491
ts: 2023-01-18T13:54:54.256Z
}
- These two errors were indicating that the Tekton Pipelines controller were trying to update the status, but it's getting this
request is too large
error. - Searching for other logs, we found these:
{
caller: pipelinerun/pipelinerun.go:1360
commit: 422a468
knative.dev/controller: github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler
knative.dev/key: pipelines/itaipu-tests-main-run-ie3c7
knative.dev/kind: tekton.dev.PipelineRun
knative.dev/traceid: d1c3c461-cd4e-4656-a913-900fede9ba6e
level: info
logger: tekton-pipelines-controller
msg: Found a TaskRun itaipu-tests-main-run-ie3c7-bors-ready that was missing from the PipelineRun status
ts: 2023-01-18T13:54:40.233Z
}
{
caller: taskrun/taskrun.go:117
commit: 422a468
knative.dev/controller: github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler
knative.dev/key: pipelines/itaipu-tests-main-run-ie3c7-bors-ready
knative.dev/kind: tekton.dev.TaskRun
knative.dev/traceid: 74f9a53a-8c95-49c0-9093-2e6095e0ffa3
level: info
logger: tekton-pipelines-controller
msg: taskrun done : itaipu-tests-main-run-ie3c7-bors-ready
ts: 2023-01-18T13:59:38.227Z
}
- These logs indicated that there was nothing wrong at the TaskRun level: the TaskRun
itaipu-tests-main-run-ie3c7-bors-ready
was done, but it was missing from the PipelineRun status, since it was failing to update the status due toetcd: request is too large
error. - This made the PipelineRun get stuck since it couldn't update the status, even for timing out.
- We are still using the
full
embedded status. We know that's not ideal for our scenario, and we want to go tominimal
as soon as possible. That being said, I still believe that we might have this error in the future for some pipelines that can get really huge.
My first thoughts were to just remove the .Status.TaskRuns
field when we receive an etcdserver: request is too large
error, but this error is handled here which it's a generated code, that we probably shouldn't touch, right?
So what we could do, it's to remove the field when the PipelineRuns timeout is reached, but the tricky part is that we can't check if we are receiving the error or not.
After considering a bit this problem, I realized that the problem is not the condition that changed, but the status of the TaskRun that doesn't fit the manifest size. So we should not receive a request is too large
if we return the timeout condition before reconciling the task runs status. Or we could simply re-copy the existing status before changing it to timeout reason.
Hello @RafaeLeal - which version of Tekton do you use?
Recently we introduced a configuration option for Tekton to embed in the status only a reference to the owned resources as opposed to a copy of the whole status. In the latest release, this behaviour is now the default.
Depending on the number of TaskRuns
this can make a significant difference in the overall size.
See "embedded status" in https://tekton.dev/docs/pipelines/install/#customizing-the-pipelines-controller-behavior
Hey @afrittoli, as I stated in the additional info, we're still using full
embedded status. We are going to change to minimal
as soon as possible, but I still think that Tekton Controller should be able to handle this error. Since we generate PipelineRuns, it's quite possible that we would generate a PipelineRun that is just too big to run.
I'm ok with Tekton just giving an error in this scenario, but having it get stuck in the cluster is quite a problem, because we use the number of TaskRuns in the cluster to know if we have "space" for more, and this "locks" part of that quota.
Hi @RafaeLeal - thanks, I missed that bit of information.
You're correct that the code where the error is managed is generated, so we cannot change it directly in Tekton.
If I read the flow correctly, the error is treated as a non-permanent one, which means the key is re-queued and it will eventually be picked up by the TaskRun
controller again (with rate limiting, but still...).
The problem is that by the time the key is reconciled again, the controller has no knowledge of the previous error and thus no way to prevent it.
The only strategy I could think of would be, in case the timeout is expired by more than a certain threshold, skip any status update except for PipelineRun
own condition. Even that would not guarantee an update in all cases, but it should help in your case to let the PipelineRun
eventually timeout. It would still require checking the logs to discover the actual cause of the failure though.
You're correct that the code where the error is managed is generated, so we cannot change it directly in Tekton. If I read the flow correctly, the error is treated as a non-permanent one, which means the key is re-queued and it will eventually be picked up by the
TaskRun
controller again (with rate limiting, but still...).The problem is that by the time the key is reconciled again, the controller has no knowledge of the previous error and thus no way to prevent it.
Thanks a lot for your input @afrittoli That's exactly the problem as I understood it!
The only strategy I could think of would be, in case the timeout is expired by more than a certain threshold, skip any status update except for
PipelineRun
own condition. Even that would not guarantee an update in all cases, but it should help in your case to let thePipelineRun
eventually timeout. It would still require checking the logs to discover the actual cause of the failure though.
This strategy you proposed was something I was thinking of too. I've implemented it here.
While implementing it, I was considering maybe we could always do a two-step timing out, this way we could avoid arbitrary thresholds. The first reconciliation would check pr.HasTimedOut()
and mark the status and return controller.NewRequeueImmediately()
. This would trigger a UpdateStatus with only the condition change, then in the second reconciliation, we could try to update the rest of the status (the childReferences
, for example).
This could work, I think. But the problem I had is that we still depend a lot on the order of the execution to make everything work properly. In the second reconciliation, we already have the timed-out condition, which means the pr.IsDone()
is true and this changes the whole reconciliation process.
https://github.com/tektoncd/pipeline/blob/main/pkg/reconciler/pipelinerun/pipelinerun.go#L205-L225
Looking at this, does it make sense to "fail fast" the reconciliation process?
For example, even inside in this pr.IsDone()
branch, it runs several cleanup proccesses. If one fails, should we really prevent the next one from executing? 🤔