`timeouts` is not working as expected.
Expected Behavior
When provided timeouts for pipeline and task in timeout section of pipelinerun, both pipelinerun and taskrun timeout should get updated
Actual Behavior
When provided timeouts for pipeline and task in timeouts section of pipelinerun, only the pipelinerun timeout is updated and the taskrun gets created with default timeout
Steps to Reproduce the Problem
- Create a task and pipeline
- Create pipelierun with timeouts section defined
Additional Info
-
Kubernetes version:
Output of
kubectl version:
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"archive", BuildDate:"2022-11-11T00:00:00Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.6+263df15", GitCommit:"eddac29feb4bb46b99fb570999324e582d761a66", GitTreeState:"clean", BuildDate:"2023-01-23T21:00:20Z", GoVersion:"go1.18.7", Compiler:"gc", Platform:"linux/amd64"}
-
Tekton Pipeline version:
Output of
tkn versionorkubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'
Client version: 0.28.0
Pipeline version: v0.41.0
Triggers version: v0.22.0
Operator version: v0.63.0
[varadhya@fedora plumbing]$ oc get tektonconfig config -o yaml | grep timeout
default-timeout-minutes: 5
[varadhya@fedora plumbing]$ cat pr.yaml
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: test-timeout-run-
spec:
timeouts:
pipeline: 2h
tasks: 1h
pipelineRef:
name: test-timeout
[varadhya@fedora plumbing]$ oc create -f pr.yaml
pipelinerun.tekton.dev/test-timeout-run-n64qf created
[varadhya@fedora plumbing]$ tkn pr describe -L
Name: test-timeout-run-n64qf
Namespace: test-timeout
Pipeline Ref: test-timeout
Service Account: pipeline
Labels:
tekton.dev/pipeline=test-timeout
🌡️ Status
STARTED DURATION STATUS
6 seconds ago --- Running
⏱ Timeouts
Pipeline: 2h0m0s
Tasks: 1h0m0s
🗂 Taskruns
NAME TASK NAME STARTED DURATION STATUS
∙ test-timeout-run-n64qf-example-task example-task 6 seconds ago --- Running
[varadhya@fedora plumbing]$ tkn tr describe -L
Name: test-timeout-run-n64qf-example-task
Namespace: test-timeout
Task Ref: example-task
Service Account: pipeline
Timeout: 5m0s
Labels:
app.kubernetes.io/managed-by=tekton-pipelines
tekton.dev/memberOf=tasks
tekton.dev/pipeline=test-timeout
tekton.dev/pipelineRun=test-timeout-run-n64qf
tekton.dev/pipelineTask=example-task
tekton.dev/task=example-task
Annotations:
pipeline.tekton.dev/release=e38d112
🌡️ Status
STARTED DURATION STATUS
25 seconds ago --- Running
⚓ Params
NAME VALUE
∙ appName 123
🦶 Steps
NAME STATUS
∙ unnamed-0 Running
thanks for filing @VeereshAradhya! The "tasks" timeout for a PipelineRun is for the cumulative time taken by all of its tasks. If you'd like to set a timeout for an individual task, you can use pipeline.spec.tasks[].timeout. I'm going to close out this issue but please feel free to reopen if you have follow-up questions.
The pipeline that I used had only one task. In the below example, I have specified task timeout as 2hours (cumulative) so the taskrun should have timeout of 2hours right?
cat pr.yaml
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: timeout-pipeline-run-
spec:
timeouts:
pipeline: 3h
tasks: 2h
pipelineRef:
name: timeout-pipeline
status: {}
[varadhya@192 plumbing]$ tkn pr describe -L
Name: timeout-pipeline-run-hml2q
Namespace: test-timeout
Pipeline Ref: timeout-pipeline
Service Account: pipeline
Labels:
tekton.dev/pipeline=timeout-pipeline
🌡️ Status
STARTED DURATION STATUS
20 seconds ago --- Running
⏱ Timeouts
Pipeline: 3h0m0s
Tasks: 2h0m0s
🗂 Taskruns
NAME TASK NAME STARTED DURATION STATUS
∙ timeout-pipeline-run-hml2q-example-task example-task 20 seconds ago --- Running
[varadhya@192 plumbing]$ tkn tr describe -L
Name: timeout-pipeline-run-hml2q-example-task
Namespace: test-timeout
Task Ref: example-task
Service Account: pipeline
Timeout: 1h0m0s
Labels:
app.kubernetes.io/managed-by=tekton-pipelines
tekton.dev/memberOf=tasks
tekton.dev/pipeline=timeout-pipeline
tekton.dev/pipelineRun=timeout-pipeline-run-hml2q
tekton.dev/pipelineTask=example-task
tekton.dev/task=example-task
Annotations:
pipeline.tekton.dev/release=e38d112
🌡️ Status
STARTED DURATION STATUS
26 seconds ago --- Running
⚓ Params
NAME VALUE
∙ appName 123
🦶 Steps
NAME STATUS
∙ unnamed-0 Running
I tested the same thing on 0.37.5 and the behaviour is different. The taskrun timeout is 2h (1h59m59.990998163s)
$ cat pr.yaml
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: test-timeout-run-
spec:
timeouts:
pipeline: 3h
tasks: 2h
pipelineRef:
name: test-timeout
[varadhya@192 plumbing]$
[varadhya@192 plumbing]$ tkn pr describe -L
Name: test-timeout-run-2x9bd
Namespace: test-timeout
Pipeline Ref: test-timeout
Service Account: pipeline
Labels:
tekton.dev/pipeline=test-timeout
🌡️ Status
STARTED DURATION STATUS
21 seconds ago --- Running
⏱ Timeouts
Pipeline: 3h0m0s
Tasks: 2h0m0s
🗂 Taskruns
NAME TASK NAME STARTED DURATION STATUS
∙ test-timeout-run-2x9bd-example-task example-task 21 seconds ago --- Running
[varadhya@192 plumbing]$
[varadhya@192 plumbing]$ tkn tr describe -L
Name: test-timeout-run-2x9bd-example-task
Namespace: test-timeout
Task Ref: example-task
Service Account: pipeline
Timeout: 1h59m59.990998163s
Labels:
app.kubernetes.io/managed-by=tekton-pipelines
tekton.dev/memberOf=tasks
tekton.dev/pipeline=test-timeout
tekton.dev/pipelineRun=test-timeout-run-2x9bd
tekton.dev/pipelineTask=example-task
tekton.dev/task=example-task
Annotations:
pipeline.tekton.dev/release=1759eab
🌡️ Status
STARTED DURATION STATUS
29 seconds ago --- Running
🦶 Steps
NAME STATUS
∙ unnamed-0 Running
[varadhya@192 plumbing]$
[varadhya@192 plumbing]$
[varadhya@192 plumbing]$ tkn version
Client version: 0.28.0
Pipeline version: v0.37.5
Triggers version: v0.20.2
Operator version: v0.60.1
/reopen
@VeereshAradhya: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Yes, we changed the behavior of timeouts between the versions you've listed here. We previously applied the time remaining from timeouts.tasks as the timeout to each taskrun. Now, the PipelineRun controller will cancel any running tasks after timeouts.tasks has elapsed. (This helped fix a race condition plus issues with retried taskruns.) The tasks timeout isn't applied to the individual taskrun timeout (even if there's only one) but the taskRuns will still be canceled after the tasks timeout has elapsed.
Oh this is actually confusing now.
When I naively look at timeouts.pipeline and timeouts.tasks, I really thing the one on tasks is for "each" tasks. I asked around a bit (friends, customers, …) and they all thought the same. In addition, what is the difference between timeouts.tasks and timeouts.pipelines now ? If timeout.tasks is "for the cumulative time taken by all of its tasks" and, a pipeline is made out of Tasks, what is the difference with timeouts.pipeline (which is for the whole pipeline, which is the "cumulation" of Tasks) ?
Yes, we changed the behavior of timeouts between the versions you've listed here.
Was there any deprecation or feature-flags here (like when we changed the when behavior) ? It is breaking some of our workloads (and possible our customers as well) because of that behavior change.
timeouts.tasks is for non-finally tasks, timeouts.pipeline includes both regular tasks and finally tasks
https://tekton.dev/docs/pipelines/pipelineruns/#configuring-a-failure-timeout
The documentation could probably make this clearer that tasks is the cumulative timeout for all non-finally tasks, not each individual task.
@vdemeester can you give more detail on what workloads were breaking due to this change? It was introduced in https://github.com/tektoncd/pipeline/pull/5134
@lbernick On our downstream pipelines, we used to have a given amount for tasks to timeout (using timeouts.tasks) that was different from the default TaskRun timeout. With the upgrade, now the timeout is the default (which is lower than the one we used to have with timeout.tasks), and thus we have our pipeline failed because of this. It is definitely fixable in the Pipeline/PipelineRun definitions, it's just that it did break some existing behaviour on upgrade.
Not sure I understand-- I would expect the pipelinerun controller to time out the tasks after timeouts.tasks elapses, even if that's shorter than the default taskrun timeout. Is there an example PipelineRun that passed prior to these changes that fails now?
Not sure I understand-- I would expect the pipelinerun controller to time out the tasks after
timeouts.taskselapses, even if that's shorter than the default taskrun timeout. Is there an example PipelineRun that passed prior to these changes that fails now?
This is what @VeereshAradhya wrote above to be fair. With timeouts.tasks = 2h, in 0.37.x it would set the TaskRun timeout to 2h, now it sets it to 1h (because it uses the default timeout for all tasks). This breaks because we relied on that timeouts.tasks to set that timeout for 2h for each TaskRun (well here it's only one but still). Now it changed 🙃
Ah I see, sorry I didn't quite understand this before! I doubt we'd want to revert these changes; is there anything you'd suggest doing to mitigate the impact? We could definitely edit the release notes to make the impact more clear.
Ah I see, sorry I didn't quite understand this before! I doubt we'd want to revert these changes; is there anything you'd suggest doing to mitigate the impact?
Oh yeah, no really asking to revert the change (yet ? 😈). It's more "for next time we need to be more careful". The "naming" is still a bit confusing to me (timeouts.alltasks or something would have made more sense to me) but, I also don't think that's a huge deal.
We could definitely edit the release notes to make the impact more clear.
👍🏼
@VeereshAradhya PTAL at https://github.com/tektoncd/pipeline/pull/6171 and let me know if these updated docs are clearer!
Also, release notes have been updated.