pipeline
pipeline copied to clipboard
task time exceeded `timeouts.tasks` when task retried
Expected Behavior
If the timeouts.tasks
time is exceeded, the task will not be retried.
finally should always be executed.
Actual Behavior
If the task is retried, the task time is exceeded timeouts.tasks.
In addition, if this causes the pipeline execution time to exceed timeouts.pipeline
, finally is force timeouted.
Steps to Reproduce the Problem
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
name: please-say-bye
spec:
tasks:
- name: hi
retries: 2
taskSpec:
steps:
- name: hi
image: alpine:3.12
script: |
echo 'hi'
sleep 10
finally:
- name: bye
taskSpec:
steps:
- name: bye
image: alpine:3.12
script: |
echo 'bye'
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: please-say-bye-
spec:
timeouts:
pipeline: 10s
tasks: 5s
finally: 5s
pipelineRef:
name: please-say-bye
β― tkn -n pipelines tr ls --label tekton.dev/pipeline=please-say-bye
NAME STARTED DURATION STATUS
please-say-bye-tzjdt-bye-pft4x 25 seconds ago 1 second Failed(TaskRunTimeout)
please-say-bye-tzjdt-hi-qqlgs 30 seconds ago 5 seconds Failed(TaskRunTimeout)
β― tkn -n pipelines pr desc please-say-bye-tzjdt
Name: please-say-bye-tzjdt
Namespace: pipelines
Pipeline Ref: please-say-bye
Service Account: default
Labels:
tekton.dev/pipeline=please-say-bye
π‘οΈ Status
STARTED DURATION STATUS
1 minute ago 14 seconds Failed
π Message
Tasks Completed: 2 (Failed: 2, Cancelled 0), Skipped: 0 (TaskRun "please-say-bye-tzjdt-bye-pft4x" failed to finish within "1s")
π¦ Resources
No resources
β Params
No params
π Results
No results
π Workspaces
No workspaces
π Taskruns
NAME TASK NAME STARTED DURATION STATUS
β please-say-bye-tzjdt-bye-pft4x bye 1 minute ago 1 second Failed(TaskRunTimeout)
β please-say-bye-tzjdt-hi-qqlgs hi 1 minute ago 5 seconds Failed(TaskRunTimeout)
βοΈ Skipped Tasks
No Skipped Tasks

Additional Info
- Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.4-eks-6b7464", GitCommit:"6b746440c04cb81db4426842b4ae65c3f7035e53", GitTreeState:"clean", BuildDate:"2021-03-19T19:33:03Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
- Tekton Pipeline version:
Client version: 0.19.0
Pipeline version: v0.25.0
Triggers version: v0.14.0
Dashboard version: v0.17.0
@ornew thanks for the issue. cc @souleb @pritidesai @sbwsg
Hi @ornew thanks for the issue. The task timeout is processed at runtime from the pipelinerun start time. If you retry only a task, taking into consideration the pipelinerun start time, your task is already timedout. Today the coded logic is to give 1s to the task to run. I am actually not sure why we give that second...
@souleb Thanks for checking this issue :)
When TaskRun retries, the Timeout is not recalculated but the StartTime is reseted. It looks to me the retried task can pass through HasTimedOut.
https://github.com/tektoncd/pipeline/blob/0e9d9e603347f85db7017180412ed25419307aa6/pkg/reconciler/pipelinerun/pipelinerun.go#L730 https://github.com/tektoncd/pipeline/blob/0e9d9e603347f85db7017180412ed25419307aa6/pkg/reconciler/pipelinerun/pipelinerun.go#L749 https://github.com/tektoncd/pipeline/blob/0e9d9e603347f85db7017180412ed25419307aa6/pkg/reconciler/pipelinerun/pipelinerun.go#L896 https://github.com/tektoncd/pipeline/blob/0e9d9e603347f85db7017180412ed25419307aa6/pkg/apis/pipeline/v1alpha1/taskrun_types.go#L231
One more... the following output is the status of the third attempt due to a timeout. The first and second attempts are in the same pod. Also, the second attempt ended with 255 instead of a timeout. It did not run until the timeout.
completionTime: '2021-07-09T12:54:11Z'
conditions:
- lastTransitionTime: '2021-07-09T12:54:11Z'
message: TaskRun "please-say-bye-62xd2-hi-wwzpm" failed to finish within "10s"
reason: TaskRunTimeout
status: 'False'
type: Succeeded
podName: please-say-bye-62xd2-hi-wwzpm-pod-2tc4n
retriesStatus:
- completionTime: '2021-07-09T12:54:00Z'
conditions:
- lastTransitionTime: '2021-07-09T12:54:00Z'
message: TaskRun "please-say-bye-62xd2-hi-wwzpm" failed to finish within "10s"
reason: TaskRunTimeout
status: 'False'
type: Succeeded
podName: please-say-bye-62xd2-hi-wwzpm-pod-8s5nz
startTime: '2021-07-09T12:53:50Z'
steps:
- container: step-hi
imageID: >-
docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e
name: hi
terminated:
exitCode: 1
finishedAt: '2021-07-09T12:54:00Z'
reason: TaskRunTimeout
startedAt: '2021-07-09T12:53:53Z'
taskSpec:
steps:
- image: 'alpine:3.12'
name: hi
resources: {}
script: |
echo 'hi'
#exit 1
sleep 30
- completionTime: '2021-07-09T12:54:01Z'
conditions:
- lastTransitionTime: '2021-07-09T12:54:01Z'
message: >
"step-hi" exited with code 255 (image:
"docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e");
for logs run: kubectl -n default logs
please-say-bye-62xd2-hi-wwzpm-pod-8s5nz -c step-hi
reason: Failed
status: 'False'
type: Succeeded
podName: please-say-bye-62xd2-hi-wwzpm-pod-8s5nz
startTime: '2021-07-09T12:54:00Z'
steps:
- container: step-hi
imageID: >-
docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e
name: hi
terminated:
containerID: >-
containerd://f86b959de7ee03bbf4e2b9df638d5d937981a863d4f124a74dcb3580535fcef9
exitCode: 255
finishedAt: '2021-07-09T12:54:00Z'
reason: Error
startedAt: '2021-07-09T12:53:54Z'
taskSpec:
steps:
- image: 'alpine:3.12'
name: hi
resources: {}
script: |
echo 'hi'
#exit 1
sleep 30
startTime: '2021-07-09T12:54:01Z'
steps:
- container: step-hi
imageID: >-
docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e
name: hi
terminated:
exitCode: 1
finishedAt: '2021-07-09T12:54:11Z'
reason: TaskRunTimeout
startedAt: '2021-07-09T12:54:04Z'
taskSpec:
steps:
- image: 'alpine:3.12'
name: hi
resources: {}
script: |
echo 'hi'
#exit 1
sleep 30
some dump
2021-07-09T14:26:36.180Z debug tekton-pipelines-controller taskrun/taskrun.go:94 MARK_DEBUG pod_name=please-say-bye-s666f-hi-f5rg9-pod-7jgnk, timeout=&Duration{Duration:10s,} retry=0 has_started=true is_done=true is_cancelled=false has_timed_out=true {"commit": "0e9d9e6", "knative.dev/controller": "github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "2e4788be-323f-4e61-8b52-5f47433e13a2", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}
2021-07-09T14:26:36.190Z debug tekton-pipelines-controller pipelinerun/pipelinerun.go:731 MARK_DEBUG pod_name=, timeout=&Duration{Duration:10s,} retry=1 new_timeout=&{10000000000} {"commit": "0e9d9e6", "knative.dev/controller": "github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler", "knative.dev/kind": "tekton.dev.PipelineRun", "knative.dev/traceid": "6f6a60c3-b5dc-4020-97ef-8dfadce1eccb", "knative.dev/key": "default/please-say-bye-s666f"}
2021-07-09T14:26:36.221Z debug tekton-pipelines-controller taskrun/taskrun.go:94 MARK_DEBUG pod_name=, timeout=&Duration{Duration:10s,} retry=1 has_started=false is_done=false is_cancelled=false has_timed_out=false {"commit": "0e9d9e6", "knative.dev/controller": "github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "eea9dc61-2f72-4718-910a-d6a82fc1d6e5", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}
2021-07-09T14:26:36.239Z debug tekton-pipelines-controller taskrun/taskrun.go:420 MARK_DEBUG DidTaskRunFail name=please-say-bye-s666f-hi-f5rg9-pod-7jgnk phase=Running {"commit": "0e9d9e6", "knative.dev/controller": "github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "eea9dc61-2f72-4718-910a-d6a82fc1d6e5", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}
2021-07-09T14:26:36.345Z debug tekton-pipelines-controller taskrun/taskrun.go:94 MARK_DEBUG pod_name=please-say-bye-s666f-hi-f5rg9-pod-7jgnk, timeout=&Duration{Duration:10s,} retry=1 has_started=true is_done=false is_cancelled=false has_timed_out=false {"commit": "0e9d9e6", "knative.dev/controller": "github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "c820bcf9-5bae-4526-bffe-8f0d41289baf", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}
This is my speculation (I'm not sure about the details)
- Reset TaskRun status when retry
- TaskRun reconcile occurs
- Pass throgh HasTimedOut because there is no status
- Get the pod faster than the pod status changed due to a delay of timeout, and reuse it (what is delay???)
- However, the pod will exit with 255 soon
https://github.com/tektoncd/pipeline/blob/0e9d9e603347f85db7017180412ed25419307aa6/pkg/reconciler/taskrun/taskrun.go#L417
What you describe is basically what happens.
Also, you are right that the Timeout
is not recalculated. But the time that is allocated to your taskrun is. And this allocated time calculation take into account several parameters, including the pipelinerun and taskrun Timeout
and StartTime
.
https://github.com/tektoncd/pipeline/blob/0e9d9e603347f85db7017180412ed25419307aa6/pkg/reconciler/pipelinerun/pipelinerun.go#L969
https://github.com/tektoncd/pipeline/blob/0e9d9e603347f85db7017180412ed25419307aa6/pkg/reconciler/pipelinerun/pipelinerun.go#L1018
It takes the declared taskrun Timeout
, add it to the StartTime
, and check if we are behind schedule. In case we are behind schedule, it still allocate 1s to run.
Does it make more sense?
Yes, I saw what was happening. This is because it is not the expected behavior. I checked whether it was intended.
The problem for me is that Tekton still doesn't guarantee the execution of finally
at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing timeout
, and I expected the implementation of the new timeouts
to be reliable. But it still doesn't work.
I think it's counterintuitive that the task exceeds the time given and that finally runs for only one second.
Could you rethink prioritizing finally
task execution over task (and retried it)? I think it's worth it.
If you look at the related tep TEP-46, you will see what you talk about in the alternative section. There have been a lot of discussion around whether the execution of finally
tasks must be guaranteed. If I remember well I think we agreed that the new timeouts
would give enough flexibility to everyone to ensure that their tasks runs as wanted. Would you mind sharing the production uses cases where that is not the case? also maybe @pritidesai you may want to add something here.
The real problem for us is that finally execution is not guaranteed.
But I'm not saying that the timeout should guarantee the execution of finally. I would like to know out whether the reason why finally is not running is motivated by a proper and intentional reason.
The TEP-46 said:
The users have an ability to specify the timeout for the tasks section.
Actually, it run tasks
beyond timeouts.tasks
when retries occur. Does this behavior match the TEP description and intuition, really?
I'm sorry I haven't been able to follow all the discussions. I was checked TEPs in advance as much as possible, I cannot find that why it allows that the tasks
execution to exceed the timeouts.tasks
. If there is good reason for it to be allowed, it should be documented in the TEP.
For example, our use cases corresponds to the dogfooding scenario shown in the TEP.
Cleanup cluster resources after finishing (with success/failure) integration tests (Dogfooding Scenario)
A simplification of the pipeline that actually caused the problem would look like this:
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: our-use-case-example-
spec:
timeouts:
# We are explicitly setting timeouts
pipeline: 5h
tasks: 4h
finally: 1h
pipelineSpec:
tasks:
- name: create-resources
retries: 2
taskSpec:
steps:
- name: create-resources
script: |
# Allocate a lot of cluster resources.
# For examples:
# - run AWS EMR clusters for processing big data
# - run many pod for load testing before deployment
- name: heavy-workload
retries: 2
taskSpec:
steps:
- name: workload
script: |
# This may be retried due to an external system...
finally:
- name: cleanup-all-resources
taskSpec:
steps:
- name: cleanup
script: |
# Cleanup all resources. We expect this to be run.
# We give the task 4 hours and the pipeline 5 hours. There is an hour grace period.
# However, if a retry occurs, this is now almost always given only 1 second.
# By not recalculating the task timeout, the overall task execution time can exceed 4 hours.
# In this case, the task will run for up to 5 hours. (why? We have specified timeouts.tasks as 4 hours.)
# It does not behave intuitively with respect to the given definition.
# As a result, some resources will not be released.
I've seen why the task runs for more than 4 hours and explained the actual problem caused by not run the finally
.
Of course, we are considering the possibility that finally
will not be executed. In reality, we perform external monitoring that releases resources, but these resources and monitoring are generally very expensive. It costs enough to expect the timeout to complete with the specified value regardless of the retry.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
/assign lbernick /assign jerop
/remove-lifecycle stale
@lbernick and I are looking into it
Just want to update with a summary of the work already done and that needs to be done:
- we've updated timeouts.tasks and timeouts.finally to account for time already spent during the pipelinerun.
- We need to update the pipelinerun controller to set a new timeout in the taskrun spec on every attempt. (We currently just update the status, including the start time.) Leaving the TaskRun's start time alone when creating a new retry will not work, as the timeout must be recalculated each time by the pipelinerun controller based on how much time has passed during execution of the pipelinerun's tasks, and the TaskRun's timeout reflects the minimum of that number and the timeout set for the pipeline task itself.
- We need to update the function that sets timeouts for TaskRuns contained in pipelines to correctly handle the situation where a pipelinerun has already timed out, but a new task is being created anyway. Currently, it will create a new taskrun with a timeout of 1s, but no new taskruns should be created, and an error should be returned. If this is not done, we will keep creating retries with a 1s timeout even after pipeline-level timeouts have passed.
- Once this is complete, we need to correctly account for time spent during retries for timeouts.finally (addressed in #4508).
Unfortunately Jerop and I don't have the bandwidth to prioritize this issue right now.
/unassign /unassign @jerop /help /lifecycle frozen
@lbernick: This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
In response to this:
Just want to update with a summary of the work already done and that needs to be done:
- we've updated timeouts.tasks and timeouts.finally to account for time already spent during the pipelinerun.
- We need to update the pipelinerun controller to set a new timeout in the taskrun spec on every attempt. (We currently just update the status, including the start time.) Leaving the TaskRun's start time alone when creating a new retry will not work, as the timeout must be recalculated each time by the pipelinerun controller based on how much time has passed during execution of the pipelinerun's tasks, and the TaskRun's timeout reflects the minimum of that number and the timeout set for the pipeline task itself.
- We need to update the function that sets timeouts for TaskRuns contained in pipelines to correctly handle the situation where a pipelinerun has already timed out, but a new task is being created anyway. Currently, it will create a new taskrun with a timeout of 1s, but no new taskruns should be created, and an error should be returned. If this is not done, we will keep creating retries with a 1s timeout even after pipeline-level timeouts have passed.
- Once this is complete, we need to correctly account for time spent during retries for timeouts.finally (addressed in #4508).
Unfortunately Jerop and I don't have the bandwidth to prioritize this issue right now.
/unassign /unassign @jerop /help /lifecycle frozen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/assign @vsinghai
I'm returning to this issue to see if it has been resolved by #5134 (FYI @abayer).
@ornew, I'm curious why you say in your original comment that the finally task should not be timed out if timeouts.pipeline
is exceeded. I think not running the finally task is intended behavior, as timeouts.pipeline
should refer to the entire time the pipeline is running. I tried using your example posted in the original comment, and it results in the pipelinerun being timed out after 10 seconds and finally tasks not being run, which I believe is the correct behavior, as the pipelinerun should stop running after 10s.
If you'd like to allow the finally tasks to run indefinitely but have the tasks section time out after some time, I think you need to specify timeouts.tasks
= 5s and timeouts.pipeline
= 0 (no timeout). (Unfortunately this doesn't work-- filed https://github.com/tektoncd/pipeline/issues/5459 but should be easily fixable.
There's still a bug, though, with the retried taskrun not being timed out when timeouts.tasks
is reached. Here's a reproducer:
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
name: please-say-bye-again
spec:
tasks:
- name: hi
retries: 2
taskSpec:
steps:
- name: hi
image: alpine:3.12
script: |
echo 'hi'
sleep 10
exit 1
finally:
- name: bye
taskSpec:
steps:
- name: bye
image: alpine:3.12
script: |
echo 'bye'
---
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: please-say-bye-again-
spec:
timeouts:
pipeline: 1m
tasks: 19s
pipelineRef:
name: please-say-bye-again
In this example, each attempt (there are 3) sleeps for 10s and fails, the finally task is run, and the pipelinerun fails. I would expect that the first attempt fails and retries, the taskrun is canceled before the second attempt completes, the finally task runs, and the pipelinerun fails.
@lbernick
I'm curious why you say in your original comment that the finally task should not be timed out if timeouts.pipeline is exceeded.
I agree you that finally task should be timeout if timeouts.pipeline is exceeded.
This issue reported originally that timeouts.tasks does not work as expected by the user. Tasks are retried even though timeouts.tasks has been exceeded until timeouts.pipeline is exceeded.
When I said "finally should not timeout" here, I meant the user's expected behavior in the context of "as a result of timeouts.pipeline being consumed due to incorrect execution of timeouts.tasks".
got it, that makes sense, thanks!
I think this should have been fixed in #5807: https://github.com/tektoncd/pipeline/pull/5807#issuecomment-1331103996