Fix for PipelineRuns getting stuck in the running state in the cluster
Changes
Issue: https://github.com/tektoncd/pipeline/issues/6076 This PR tries to fix scenarios where in the same reconciliation cycle the PipelineRun status gets too big to be updated and the status is changed to timeout. This means that the UpdateStatus fails, so the PipelineRun can't get out from the running status, even after the timeout. The PR introduces an arbitrary threshold (2*timeout), and if the reconciliation notice that threshold is reached, it skips any other update, and updates only the status, making the request as small as possible, avoiding etcd request size errors.
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
- [x] Has Docs included if any changes are user facing
- [x] Has Tests included if any functionality added or changed
- [x] Follows the commit message standard
- [x] Meets the Tekton contributor standards (including functionality, content, code)
- [x] Has a kind label. You can add one by adding a comment on this PR that contains
/kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep - [x] Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
- [x] Release notes contains the string "action required" if the change requires additional action from users switching to the new release
Release Notes
Fix a bug that made big PipelineRuns get stuck in the running state in the cluster
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: RafaeLeal / name: Rafael Leal (bcfc8b2d391020ccb1e4a017235d12aa7bfc81b7)
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 83.5% | -4.9 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 86.6% | 86.5% | -0.0 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 83.5% | -4.9 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 86.6% | 86.5% | -0.0 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 83.5% | -4.9 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 86.6% | 86.5% | -0.0 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 83.5% | -4.9 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 86.6% | 86.5% | -0.0 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 83.5% | -4.9 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 86.6% | 86.5% | -0.0 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 83.5% | -4.9 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 86.6% | 86.5% | -0.0 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 83.5% | -4.9 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 86.6% | 86.5% | -0.0 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 83.5% | -4.9 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 86.6% | 86.5% | -0.0 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
/kind bug
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 83.5% | -4.9 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.2% | 87.2% | -0.1 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 83.5% | -4.9 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.2% | 87.2% | -0.1 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
Thanks, @RafaeLeal for this PR!
Could you provide a bit more context about the issue in the commit message / PR description, and fill in the PR checklist from the template? Also, tide does not automatically squash commits before merging, so I would ask you to please squash them at least once the PR has been approved.
Thank you!
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 83.5% | -4.9 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.2% | 87.0% | -0.2 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 83.5% | -4.9 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.2% | 87.0% | -0.2 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 89.0% | 0.6 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.2% | 87.0% | -0.2 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
Could you provide a bit more context about the issue in the commit message / PR description, and fill in the PR checklist from the template? Also, tide does not automatically squash commits before merging, so I would ask you to please squash them at least once the PR has been approved.
Sure, I was not done yet actually. I added another test on HasTimedOutForALongTime func. I had some ideas on how to avoid arbitrary thresholds, that I shared in the issue, which I quote here.
While implementing it, I was considering maybe we could always do a two-step timing out, this way we could avoid arbitrary thresholds. The first reconciliation would check
pr.HasTimedOut()and mark the status and returncontroller.NewRequeueImmediately(). This would trigger a UpdateStatus with only the condition change, then in the second reconciliation, we could try to update the rest of the status (thechildReferences, for example).
This could work, I think. But the problem I had is that we still depend a lot on the order of the execution to make everything work properly. In the second reconciliation, we already have the timed-out condition, which means the
pr.IsDone()is true and this changes the whole reconciliation process. https://github.com/tektoncd/pipeline/blob/main/pkg/reconciler/pipelinerun/pipelinerun.go#L205-L225
Looking at this, does it make sense to "fail fast" the reconciliation process? For example, even inside in this
pr.IsDone()branch, it runs several cleanup processes. If one fails, should we really prevent the next one from executing? 🤔
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 89.0% | 0.6 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.2% | 87.0% | -0.2 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 89.0% | 0.6 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.2% | 87.0% | -0.2 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 89.0% | 0.6 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.2% | 87.0% | -0.2 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 89.0% | 0.6 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.2% | 87.0% | -0.2 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
/retest
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 89.0% | 0.6 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.2% | 87.0% | -0.2 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 88.3% | 89.0% | 0.6 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.2% | 87.0% | -0.2 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 87.9% | 88.6% | 0.7 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.1% | 86.9% | -0.2 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 87.9% | 88.6% | 0.7 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.1% | 86.9% | -0.2 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 87.9% | 88.6% | 0.7 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.1% | 86.9% | -0.2 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 87.9% | 88.6% | 0.7 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.1% | 86.9% | -0.2 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report
| File | Old Coverage | New Coverage | Delta |
|---|---|---|---|
| pkg/apis/pipeline/v1beta1/pipelinerun_types.go | 87.9% | 88.6% | 0.7 |
| pkg/reconciler/pipelinerun/pipelinerun.go | 87.1% | 86.9% | -0.2 |
| pkg/reconciler/pipelinerun/timeout.go | 84.1% | 84.8% | 0.7 |
While implementing it, I was considering maybe we could always do a two-step timing out, this way we could avoid arbitrary thresholds. The first reconciliation would check
pr.HasTimedOut()and mark the status and returncontroller.NewRequeueImmediately(). This would trigger a UpdateStatus with only the condition change, then in the second reconciliation, we could try to update the rest of the status (thechildReferences, for example).
It could work, but I have two concerns:
- orchestrating updates across reconciles can be tricky - if there is no external input it's generally best to try and do everything in one reconcile cycle
- but I think it would be confusing for clients. If a client reads that the pipelinerun has completed, how does it know whether there is an update missing somewhere? Even with
NewRequeueImmediately, the controller queue can be quite busy and it may take some time before the new reconcile happens