pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

`timeouts` is not working as expected.

Open VeereshAradhya opened this issue 2 years ago • 14 comments

Expected Behavior

When provided timeouts for pipeline and task in timeout section of pipelinerun, both pipelinerun and taskrun timeout should get updated

Actual Behavior

When provided timeouts for pipeline and task in timeouts section of pipelinerun, only the pipelinerun timeout is updated and the taskrun gets created with default timeout

Steps to Reproduce the Problem

  1. Create a task and pipeline
  2. Create pipelierun with timeouts section defined

Additional Info

  • Kubernetes version:

    Output of kubectl version:

 Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"archive", BuildDate:"2022-11-11T00:00:00Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
 Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.6+263df15", GitCommit:"eddac29feb4bb46b99fb570999324e582d761a66", GitTreeState:"clean", BuildDate:"2023-01-23T21:00:20Z", GoVersion:"go1.18.7", Compiler:"gc", Platform:"linux/amd64"}

  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

Client version: 0.28.0
Pipeline version: v0.41.0
Triggers version: v0.22.0
Operator version: v0.63.0
[varadhya@fedora plumbing]$ oc get tektonconfig config -o yaml | grep timeout 
    default-timeout-minutes: 5
[varadhya@fedora plumbing]$ cat pr.yaml
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: test-timeout-run-
spec:
  timeouts:
    pipeline: 2h
    tasks: 1h
  pipelineRef:
    name: test-timeout
[varadhya@fedora plumbing]$ oc create -f pr.yaml
pipelinerun.tekton.dev/test-timeout-run-n64qf created
[varadhya@fedora plumbing]$ tkn pr describe -L
Name:              test-timeout-run-n64qf
Namespace:         test-timeout
Pipeline Ref:      test-timeout
Service Account:   pipeline
Labels:
 tekton.dev/pipeline=test-timeout

🌡️  Status

STARTED         DURATION   STATUS
6 seconds ago   ---        Running

⏱  Timeouts
 Pipeline:   2h0m0s
 Tasks:      1h0m0s

🗂  Taskruns

 NAME                                    TASK NAME      STARTED         DURATION   STATUS
 ∙ test-timeout-run-n64qf-example-task   example-task   6 seconds ago   ---        Running
[varadhya@fedora plumbing]$ tkn tr describe -L
Name:              test-timeout-run-n64qf-example-task
Namespace:         test-timeout
Task Ref:          example-task
Service Account:   pipeline
Timeout:           5m0s
Labels:
 app.kubernetes.io/managed-by=tekton-pipelines
 tekton.dev/memberOf=tasks
 tekton.dev/pipeline=test-timeout
 tekton.dev/pipelineRun=test-timeout-run-n64qf
 tekton.dev/pipelineTask=example-task
 tekton.dev/task=example-task
Annotations:
 pipeline.tekton.dev/release=e38d112

🌡️  Status

STARTED          DURATION    STATUS
25 seconds ago   ---         Running

⚓ Params

 NAME        VALUE
 ∙ appName   123

🦶 Steps

 NAME          STATUS
 ∙ unnamed-0   Running

VeereshAradhya avatar Feb 09 '23 15:02 VeereshAradhya

thanks for filing @VeereshAradhya! The "tasks" timeout for a PipelineRun is for the cumulative time taken by all of its tasks. If you'd like to set a timeout for an individual task, you can use pipeline.spec.tasks[].timeout. I'm going to close out this issue but please feel free to reopen if you have follow-up questions.

lbernick avatar Feb 09 '23 15:02 lbernick

The pipeline that I used had only one task. In the below example, I have specified task timeout as 2hours (cumulative) so the taskrun should have timeout of 2hours right?

cat pr.yaml
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: timeout-pipeline-run-
spec:
  timeouts:
    pipeline: 3h
    tasks: 2h
  pipelineRef:
    name: timeout-pipeline
status: {}
[varadhya@192 plumbing]$ tkn pr describe -L
Name:              timeout-pipeline-run-hml2q
Namespace:         test-timeout
Pipeline Ref:      timeout-pipeline
Service Account:   pipeline
Labels:
 tekton.dev/pipeline=timeout-pipeline

🌡️  Status

STARTED          DURATION   STATUS
20 seconds ago   ---        Running

⏱  Timeouts
 Pipeline:   3h0m0s
 Tasks:      2h0m0s

🗂  Taskruns

 NAME                                        TASK NAME      STARTED          DURATION   STATUS
 ∙ timeout-pipeline-run-hml2q-example-task   example-task   20 seconds ago   ---        Running
[varadhya@192 plumbing]$ tkn tr describe -L
Name:              timeout-pipeline-run-hml2q-example-task
Namespace:         test-timeout
Task Ref:          example-task
Service Account:   pipeline
Timeout:           1h0m0s
Labels:
 app.kubernetes.io/managed-by=tekton-pipelines
 tekton.dev/memberOf=tasks
 tekton.dev/pipeline=timeout-pipeline
 tekton.dev/pipelineRun=timeout-pipeline-run-hml2q
 tekton.dev/pipelineTask=example-task
 tekton.dev/task=example-task
Annotations:
 pipeline.tekton.dev/release=e38d112

🌡️  Status

STARTED          DURATION    STATUS
26 seconds ago   ---         Running

⚓ Params

 NAME        VALUE
 ∙ appName   123

🦶 Steps

 NAME          STATUS
 ∙ unnamed-0   Running

VeereshAradhya avatar Feb 09 '23 16:02 VeereshAradhya

I tested the same thing on 0.37.5 and the behaviour is different. The taskrun timeout is 2h (1h59m59.990998163s)

$ cat pr.yaml
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: test-timeout-run-
spec:
  timeouts:
    pipeline: 3h
    tasks: 2h
  pipelineRef:
    name: test-timeout
[varadhya@192 plumbing]$ 
[varadhya@192 plumbing]$ tkn pr describe -L
Name:              test-timeout-run-2x9bd
Namespace:         test-timeout
Pipeline Ref:      test-timeout
Service Account:   pipeline
Labels:
 tekton.dev/pipeline=test-timeout

🌡️  Status

STARTED          DURATION   STATUS
21 seconds ago   ---        Running

⏱  Timeouts
 Pipeline:   3h0m0s
 Tasks:      2h0m0s

🗂  Taskruns

 NAME                                    TASK NAME      STARTED          DURATION   STATUS
 ∙ test-timeout-run-2x9bd-example-task   example-task   21 seconds ago   ---        Running
[varadhya@192 plumbing]$ 
[varadhya@192 plumbing]$ tkn tr describe -L
Name:              test-timeout-run-2x9bd-example-task
Namespace:         test-timeout
Task Ref:          example-task
Service Account:   pipeline
Timeout:           1h59m59.990998163s
Labels:
 app.kubernetes.io/managed-by=tekton-pipelines
 tekton.dev/memberOf=tasks
 tekton.dev/pipeline=test-timeout
 tekton.dev/pipelineRun=test-timeout-run-2x9bd
 tekton.dev/pipelineTask=example-task
 tekton.dev/task=example-task
Annotations:
 pipeline.tekton.dev/release=1759eab

🌡️  Status

STARTED          DURATION    STATUS
29 seconds ago   ---         Running

🦶 Steps

 NAME          STATUS
 ∙ unnamed-0   Running
[varadhya@192 plumbing]$ 
[varadhya@192 plumbing]$ 
[varadhya@192 plumbing]$ tkn version
Client version: 0.28.0
Pipeline version: v0.37.5
Triggers version: v0.20.2
Operator version: v0.60.1

VeereshAradhya avatar Feb 09 '23 16:02 VeereshAradhya

/reopen

VeereshAradhya avatar Feb 09 '23 16:02 VeereshAradhya

@VeereshAradhya: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot avatar Feb 09 '23 16:02 tekton-robot

Yes, we changed the behavior of timeouts between the versions you've listed here. We previously applied the time remaining from timeouts.tasks as the timeout to each taskrun. Now, the PipelineRun controller will cancel any running tasks after timeouts.tasks has elapsed. (This helped fix a race condition plus issues with retried taskruns.) The tasks timeout isn't applied to the individual taskrun timeout (even if there's only one) but the taskRuns will still be canceled after the tasks timeout has elapsed.

lbernick avatar Feb 09 '23 18:02 lbernick

Oh this is actually confusing now. When I naively look at timeouts.pipeline and timeouts.tasks, I really thing the one on tasks is for "each" tasks. I asked around a bit (friends, customers, …) and they all thought the same. In addition, what is the difference between timeouts.tasks and timeouts.pipelines now ? If timeout.tasks is "for the cumulative time taken by all of its tasks" and, a pipeline is made out of Tasks, what is the difference with timeouts.pipeline (which is for the whole pipeline, which is the "cumulation" of Tasks) ?

Yes, we changed the behavior of timeouts between the versions you've listed here.

Was there any deprecation or feature-flags here (like when we changed the when behavior) ? It is breaking some of our workloads (and possible our customers as well) because of that behavior change.

vdemeester avatar Feb 10 '23 09:02 vdemeester

timeouts.tasks is for non-finally tasks, timeouts.pipeline includes both regular tasks and finally tasks https://tekton.dev/docs/pipelines/pipelineruns/#configuring-a-failure-timeout

The documentation could probably make this clearer that tasks is the cumulative timeout for all non-finally tasks, not each individual task.

AlanGreene avatar Feb 10 '23 09:02 AlanGreene

@vdemeester can you give more detail on what workloads were breaking due to this change? It was introduced in https://github.com/tektoncd/pipeline/pull/5134

lbernick avatar Feb 10 '23 13:02 lbernick

@lbernick On our downstream pipelines, we used to have a given amount for tasks to timeout (using timeouts.tasks) that was different from the default TaskRun timeout. With the upgrade, now the timeout is the default (which is lower than the one we used to have with timeout.tasks), and thus we have our pipeline failed because of this. It is definitely fixable in the Pipeline/PipelineRun definitions, it's just that it did break some existing behaviour on upgrade.

vdemeester avatar Feb 10 '23 15:02 vdemeester

Not sure I understand-- I would expect the pipelinerun controller to time out the tasks after timeouts.tasks elapses, even if that's shorter than the default taskrun timeout. Is there an example PipelineRun that passed prior to these changes that fails now?

lbernick avatar Feb 10 '23 21:02 lbernick

Not sure I understand-- I would expect the pipelinerun controller to time out the tasks after timeouts.tasks elapses, even if that's shorter than the default taskrun timeout. Is there an example PipelineRun that passed prior to these changes that fails now?

This is what @VeereshAradhya wrote above to be fair. With timeouts.tasks = 2h, in 0.37.x it would set the TaskRun timeout to 2h, now it sets it to 1h (because it uses the default timeout for all tasks). This breaks because we relied on that timeouts.tasks to set that timeout for 2h for each TaskRun (well here it's only one but still). Now it changed 🙃

vdemeester avatar Feb 14 '23 16:02 vdemeester

Ah I see, sorry I didn't quite understand this before! I doubt we'd want to revert these changes; is there anything you'd suggest doing to mitigate the impact? We could definitely edit the release notes to make the impact more clear.

lbernick avatar Feb 14 '23 17:02 lbernick

Ah I see, sorry I didn't quite understand this before! I doubt we'd want to revert these changes; is there anything you'd suggest doing to mitigate the impact?

Oh yeah, no really asking to revert the change (yet ? 😈). It's more "for next time we need to be more careful". The "naming" is still a bit confusing to me (timeouts.alltasks or something would have made more sense to me) but, I also don't think that's a huge deal.

We could definitely edit the release notes to make the impact more clear.

👍🏼

vdemeester avatar Feb 14 '23 17:02 vdemeester

@VeereshAradhya PTAL at https://github.com/tektoncd/pipeline/pull/6171 and let me know if these updated docs are clearer!

lbernick avatar Feb 15 '23 14:02 lbernick

Also, release notes have been updated.

lbernick avatar Feb 15 '23 14:02 lbernick