pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

PipelineRun is canceled prior to its child TaskRuns being canceled.

Open chuangw6 opened this issue 1 year ago • 12 comments

Expected Behavior

If a PipelineRun is cancelled during execution, the PipelineRun should be marked as terminal status by pipelinerun reconciler after all its child TaskRuns have been cancelled successfully by taskrun reconciler. To users, this means that

  • PipelineRun completion time should be greater than its child TaskRuns completion time.
  • The final PipelineRun update should be emitted after all its child TaskRuns' final updates are emitted.

Actual Behavior

  • Final PipelineRun update is emitted before child taskruns are emitted.
  • PipelineRun completion time is smaller than child taskrun completion time.

Steps to Reproduce the Problem

  1. patch this commit that adds logs into controller
  2. Run this pipelinerun
apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  name: tekton-pr
spec:
  pipelineSpec:
    tasks:
      - name: t1
        taskSpec:
          steps:
            - image: busybox
              script: |
                sleep 600
  1. Cancel the PipelineRun by applying this modified yaml.
apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  name: tekton-pr
spec:
  status: "Cancelled"
  pipelineSpec:
    tasks:
      - name: t1
        taskSpec:
          steps:
            - image: busybox
              script: |
                sleep 600
  1. Check the pipeline controller log, you will see the log that shows that the terminal pipelinerun is emitted first and has smaller completion time (searching >>> emitting).
{"severity":"warn","timestamp":"2024-06-06T15:49:43.749Z","logger":"tekton-pipelines-controller","caller":"pipelinerun/pipelinerun.go:253","message":">>> emitting pipelinerun, name: tekton-pr, completion time: 2024-06-06 15:49:43.749423811 +0000 UTC m=+87.911781253, condition status: False, condition ltt: {2024-06-06 15:49:43.749422032 +0000 UTC m=+87.911779516}","commit":"9760223-dirty","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"5bb03973-86fb-42e4-b1f6-858ef01876fe","knative.dev/key":"default/tekton-pr"}
{"severity":"warn","timestamp":"2024-06-06T15:49:43.749Z","logger":"tekton-pipelines-controller","caller":"taskrun/taskrun.go:777","message":"stopping task run \"tekton-pr-t1\" because of \"TaskRunCancelled\"","commit":"9760223-dirty","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"090e81e8-7394-4ddc-8d15-4551931c8abd","knative.dev/key":"default/tekton-pr-t1"}
{"severity":"info","timestamp":"2024-06-06T15:49:43.749Z","logger":"tekton-pipelines-controller.event-broadcaster","caller":"record/event.go:364","message":"Event(v1.ObjectReference{Kind:\"PipelineRun\", Namespace:\"default\", Name:\"tekton-pr\", UID:\"32b2ce33-3cb2-4502-9165-fcf4457e7375\", APIVersion:\"tekton.dev/v1\", ResourceVersion:\"169834134\", FieldPath:\"\"}): type: 'Warning' reason: 'Failed' PipelineRun \"tekton-pr\" was cancelled","commit":"9760223-dirty"}
{"severity":"warn","timestamp":"2024-06-06T15:49:43.768Z","logger":"tekton-pipelines-controller","caller":"taskrun/taskrun.go:166","message":">>> emitting taskrun, name: tekton-pr-t1, completion time: 2024-06-06 15:49:43.749754967 +0000 UTC m=+87.912112418, condition status: False, condition ltt: {2024-06-06 15:49:43.74975262 +0000 UTC m=+87.912110080}","commit":"9760223-dirty","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"090e81e8-7394-4ddc-8d15-4551931c8abd","knative.dev/key":"default/tekton-pr-t1"}

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"28", GitVersion:"v1.28.9-gke.1000000", GitCommit:"21b78d6e66a8659125f752293620ab9b2ed9bd7b", GitTreeState:"clean", BuildDate:"2024-04-17T09:14:21Z", GoVersion:"go1.21.9 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.23) and server (1.28) exceeds the supported minor version skew of +/-1
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

latest main

chuangw6 avatar Jun 06 '24 16:06 chuangw6

It looks like Tekton intends to mark PR as canceled after all its child TRs are marked as canceled successfully (code source).

However, taskrun reconciler needs a bit time to reconcile to mark child TR as canceled and to emit the final update after pipelinerun reconciler patches the tr. Since pipelinerun reconciler cancels the PipelineRun and emits it immediately after patching the taskruns, the pr is emitted prior to tr being emitted and has earlier completion time.

chuangw6 avatar Jun 06 '24 16:06 chuangw6

Looked into the log a bit more. It looks like the parent pipelinerun was canceled twice and thus 2 terminal pipelinerun events were emitted (first line and last line in the following log). Both 2 events reported pipelinerun with terminal state condition status==false that means it's done, but the completion time and LastTransitionTime are changed in the second event, which is confusing.

{"severity":"warn","timestamp":"2024-06-07T00:02:46.271Z","logger":"tekton-pipelines-controller","caller":"pipelinerun/pipelinerun.go:253","message":">>>> emitting pipelinerun, name: tekton-pr-emit-ha, completion time: 2024-06-07 00:02:46.271534506 +0000 UTC m=+113.976747118, condition status: False, condition ltt: {2024-06-07 00:02:46.271533558 +0000 UTC m=+113.976746181}","commit":"edf7e90-dirty","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"3501b088-0278-43ac-b67c-e25e44298096","knative.dev/key":"default/tekton-pr-emit-ha"}
{"severity":"info","timestamp":"2024-06-07T00:02:46.271Z","logger":"tekton-pipelines-controller.event-broadcaster","caller":"record/event.go:364","message":"Event(v1.ObjectReference{Kind:\"PipelineRun\", Namespace:\"default\", Name:\"tekton-pr-emit-ha\", UID:\"e96bb31f-a06b-46f9-b3a1-a74a50360d04\", APIVersion:\"tekton.dev/v1\", ResourceVersion:\"170179062\", FieldPath:\"\"}): type: 'Warning' reason: 'Failed' PipelineRun \"tekton-pr-emit-ha\" was cancelled","commit":"edf7e90-dirty"}
{"severity":"warn","timestamp":"2024-06-07T00:02:46.286Z","logger":"tekton-pipelines-controller","caller":"taskrun/taskrun.go:166","message":">>>> emitting taskrun, name: tekton-pr-emit-ha-matrix-task-ha, completion time: 2024-06-07 00:02:46.271505564 +0000 UTC m=+113.976718177, condition status: False, condition ltt: {2024-06-07 00:02:46.271493449 +0000 UTC m=+113.976706071}","commit":"edf7e90-dirty","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"e0a1fea5-640b-40ec-a647-984ee0248077","knative.dev/key":"default/tekton-pr-emit-ha-matrix-task-ha"}
{"severity":"info","timestamp":"2024-06-07T00:02:46.288Z","logger":"tekton-pipelines-controller.event-broadcaster","caller":"record/event.go:364","message":"Event(v1.ObjectReference{Kind:\"TaskRun\", Namespace:\"default\", Name:\"tekton-pr-emit-ha-matrix-task-ha\", UID:\"941e7a91-c8b3-400c-b29c-56f79237057f\", APIVersion:\"tekton.dev/v1\", ResourceVersion:\"170179063\", FieldPath:\"\"}): type: 'Warning' reason: 'Failed' TaskRun \"tekton-pr-emit-ha-matrix-task-ha\" was cancelled. TaskRun cancelled as the PipelineRun it belongs to has been cancelled.","commit":"edf7e90-dirty"}
{"severity":"info","timestamp":"2024-06-07T00:02:46.296Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:550","message":"Reconcile succeeded","commit":"edf7e90-dirty","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"3501b088-0278-43ac-b67c-e25e44298096","knative.dev/key":"default/tekton-pr-emit-ha","duration":0.047430652}
{"severity":"info","timestamp":"2024-06-07T00:02:46.296Z","logger":"tekton-pipelines-controller","caller":"pipelinerun/cancel.go:139","message":"cancelling TaskRun tekton-pr-emit-ha-matrix-task-ha","commit":"edf7e90-dirty","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"47592fa6-3a06-4274-b05e-58d33303fc21","knative.dev/key":"default/tekton-pr-emit-ha"}
{"severity":"info","timestamp":"2024-06-07T00:02:46.311Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:550","message":"Reconcile succeeded","commit":"edf7e90-dirty","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"e0a1fea5-640b-40ec-a647-984ee0248077","knative.dev/key":"default/tekton-pr-emit-ha-matrix-task-ha","duration":0.040763412}
{"severity":"info","timestamp":"2024-06-07T00:02:46.312Z","logger":"tekton-pipelines-controller","caller":"taskrun/taskrun.go:149","message":"taskrun done : tekton-pr-emit-ha-matrix-task-ha \n","commit":"edf7e90-dirty","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"a65f95f4-9b47-40f6-9b0f-93f325712bca","knative.dev/key":"default/tekton-pr-emit-ha-matrix-task-ha"}
{"severity":"info","timestamp":"2024-06-07T00:02:46.312Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:550","message":"Reconcile succeeded","commit":"edf7e90-dirty","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"a65f95f4-9b47-40f6-9b0f-93f325712bca","knative.dev/key":"default/tekton-pr-emit-ha-matrix-task-ha","duration":0.000230135}
{"severity":"warn","timestamp":"2024-06-07T00:02:46.331Z","logger":"tekton-pipelines-controller","caller":"pipelinerun/pipelinerun.go:253","message":">>>> emitting pipelinerun, name: tekton-pr-emit-ha, completion time: 2024-06-07 00:02:46.331209788 +0000 UTC m=+114.036422401, condition status: False, condition ltt: {2024-06-07 00:02:46.331207784 +0000 UTC m=+114.036420413}","commit":"edf7e90-dirty","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"47592fa6-3a06-4274-b05e-58d33303fc21","knative.dev/key":"default/tekton-pr-emit-ha"}

chuangw6 avatar Jun 07 '24 00:06 chuangw6

Right now when we cancel the pipelineRun, we update the Condition of the PipelineRun Status and mark it as cancelled. At the same time, we submit patch for the Child Tasks. That's why it appears that the PipelineRun was cancelled before the child Tasks.

I think the correct approach should be for the pipelineRun controller to send the patches to the child Tasks to cancel and before updating it's own condition to cancelled, it should wait until all the Child Tasks have been cancelled and then cancel itself.

WDYT @tektoncd/core-maintainers ?

chitrangpatel avatar Jun 26 '24 12:06 chitrangpatel

/assign @renzodavid9

chitrangpatel avatar Jun 27 '24 14:06 chitrangpatel

@chitrangpatel: GitHub didn't allow me to assign the following users: renzodavid9.

Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign @renzodavid9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot avatar Jun 27 '24 14:06 tekton-robot

/assign me

renzodavid9 avatar Jun 28 '24 14:06 renzodavid9

@renzodavid9: GitHub didn't allow me to assign the following users: me.

Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign me

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot avatar Jun 28 '24 14:06 tekton-robot

/assign @renzodavid9

renzodavid9 avatar Jun 28 '24 14:06 renzodavid9

Hello there,

To elaborate on @chitrangpatel comment, this change will make PipelineRun cancellation asynchronous. With this, we would need the PipelineRun to enter a "Cancelling" state when cancellation is triggered and remain there until all child TaskRuns are cancelled, at which point it will mark itself as "Cancelled". We currently don't have such a "Cancelling" state, so we'll need to add it and make the transition as explained before. It would look something like:

  1. When cancellation is triggered:
apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  name: pipeline-execution
...
status:
  conditions:
  - lastTransitionTime: ...
    message: PipelineRun "pipeline-execution" is cancelling
    reason: Cancelling
    status: Unknown
    type: Succeeded
  1. After all the PipelineRun child tasks stop, it will go to "Cancelled":
apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  name: pipeline-execution
...
status:
  conditions:
  - lastTransitionTime: "2024-07-22T20:13:42Z"
    message: PipelineRun "pipeline-execution" was cancelled
    reason: Cancelled
    status: "False"
    type: Succeeded

I would like to validate this idea here to see if anyone have any thoughts about it @tektoncd/core-maintainers

Thanks!

renzodavid9 avatar Jul 22 '24 20:07 renzodavid9

I think I like the idea, but I wonder how impactful it might be for existing clients (and the change / addition of a new "reason"). I would think it might not be super breaking as the "end state" doesn't change, there is just going to be an "unknown" state, but we need to asses this. Otherwise, it makes sense to me.

vdemeester avatar Jul 30 '24 13:07 vdemeester

Is it possible to add a condition in the annotation instead of adding the condition state Cancelling, similar to tekton.dev/canceling-state, so that it will not affect existing clients?

chengjoey avatar Aug 20 '24 07:08 chengjoey

@chengjoey I'ld rather go with the Condition state. This is not "backward" incompatible per se, it's just that client might not know this state, but all other state they know are still there, so it shouldn't break them.

vdemeester avatar Aug 26 '24 11:08 vdemeester