pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

Fix for PipelineRuns getting stuck in the running state in the cluster

Open RafaeLeal opened this issue 3 years ago • 26 comments

Changes

Issue: https://github.com/tektoncd/pipeline/issues/6076 This PR tries to fix scenarios where in the same reconciliation cycle the PipelineRun status gets too big to be updated and the status is changed to timeout. This means that the UpdateStatus fails, so the PipelineRun can't get out from the running status, even after the timeout. The PR introduces an arbitrary threshold (2*timeout), and if the reconciliation notice that threshold is reached, it skips any other update, and updates only the status, making the request as small as possible, avoiding etcd request size errors.

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • [x] Has Docs included if any changes are user facing
  • [x] Has Tests included if any functionality added or changed
  • [x] Follows the commit message standard
  • [x] Meets the Tekton contributor standards (including functionality, content, code)
  • [x] Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • [x] Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • [x] Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Fix a bug that made big PipelineRuns get stuck in the running state in the cluster

RafaeLeal avatar Feb 01 '23 03:02 RafaeLeal

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: RafaeLeal / name: Rafael Leal (bcfc8b2d391020ccb1e4a017235d12aa7bfc81b7)

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 01 '23 04:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 01 '23 04:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 01 '23 20:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 01 '23 20:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 02 '23 17:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 02 '23 17:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 03 '23 13:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 03 '23 13:02 tekton-robot

/kind bug

RafaeLeal avatar Feb 03 '23 14:02 RafaeLeal

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.2% -0.1
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 06 '23 12:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.2% -0.1
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 06 '23 12:02 tekton-robot

Thanks, @RafaeLeal for this PR!

Could you provide a bit more context about the issue in the commit message / PR description, and fill in the PR checklist from the template? Also, tide does not automatically squash commits before merging, so I would ask you to please squash them at least once the PR has been approved.

Thank you!

afrittoli avatar Feb 06 '23 14:02 afrittoli

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.0% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 06 '23 22:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.0% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 06 '23 22:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 89.0% 0.6
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.0% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 06 '23 23:02 tekton-robot

Could you provide a bit more context about the issue in the commit message / PR description, and fill in the PR checklist from the template? Also, tide does not automatically squash commits before merging, so I would ask you to please squash them at least once the PR has been approved.

Sure, I was not done yet actually. I added another test on HasTimedOutForALongTime func. I had some ideas on how to avoid arbitrary thresholds, that I shared in the issue, which I quote here.

While implementing it, I was considering maybe we could always do a two-step timing out, this way we could avoid arbitrary thresholds. The first reconciliation would check pr.HasTimedOut() and mark the status and return controller.NewRequeueImmediately(). This would trigger a UpdateStatus with only the condition change, then in the second reconciliation, we could try to update the rest of the status (the childReferences, for example).

This could work, I think. But the problem I had is that we still depend a lot on the order of the execution to make everything work properly. In the second reconciliation, we already have the timed-out condition, which means the pr.IsDone() is true and this changes the whole reconciliation process. https://github.com/tektoncd/pipeline/blob/main/pkg/reconciler/pipelinerun/pipelinerun.go#L205-L225

Looking at this, does it make sense to "fail fast" the reconciliation process? For example, even inside in this pr.IsDone() branch, it runs several cleanup processes. If one fails, should we really prevent the next one from executing? 🤔

RafaeLeal avatar Feb 06 '23 23:02 RafaeLeal

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 89.0% 0.6
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.0% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 06 '23 23:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 89.0% 0.6
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.0% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 06 '23 23:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 89.0% 0.6
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.0% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 06 '23 23:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 89.0% 0.6
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.0% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 06 '23 23:02 tekton-robot

/retest

RafaeLeal avatar Feb 07 '23 12:02 RafaeLeal

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 89.0% 0.6
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.0% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 08 '23 13:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 89.0% 0.6
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.0% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 08 '23 13:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 87.9% 88.6% 0.7
pkg/reconciler/pipelinerun/pipelinerun.go 87.1% 86.9% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 16 '23 22:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 87.9% 88.6% 0.7
pkg/reconciler/pipelinerun/pipelinerun.go 87.1% 86.9% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 16 '23 22:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 87.9% 88.6% 0.7
pkg/reconciler/pipelinerun/pipelinerun.go 87.1% 86.9% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 16 '23 22:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 87.9% 88.6% 0.7
pkg/reconciler/pipelinerun/pipelinerun.go 87.1% 86.9% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 17 '23 13:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 87.9% 88.6% 0.7
pkg/reconciler/pipelinerun/pipelinerun.go 87.1% 86.9% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

tekton-robot avatar Feb 17 '23 13:02 tekton-robot

While implementing it, I was considering maybe we could always do a two-step timing out, this way we could avoid arbitrary thresholds. The first reconciliation would check pr.HasTimedOut() and mark the status and return controller.NewRequeueImmediately(). This would trigger a UpdateStatus with only the condition change, then in the second reconciliation, we could try to update the rest of the status (the childReferences, for example).

It could work, but I have two concerns:

  • orchestrating updates across reconciles can be tricky - if there is no external input it's generally best to try and do everything in one reconcile cycle
  • but I think it would be confusing for clients. If a client reads that the pipelinerun has completed, how does it know whether there is an update missing somewhere? Even with NewRequeueImmediately, the controller queue can be quite busy and it may take some time before the new reconcile happens

afrittoli avatar Feb 17 '23 14:02 afrittoli