pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

task time exceeded `timeouts.tasks` when task retried

Open ornew opened this issue 3 years ago β€’ 16 comments

Expected Behavior

If the timeouts.tasks time is exceeded, the task will not be retried.

finally should always be executed.

Actual Behavior

If the task is retried, the task time is exceeded timeouts.tasks.

In addition, if this causes the pipeline execution time to exceed timeouts.pipeline, finally is force timeouted.

Steps to Reproduce the Problem

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: please-say-bye
spec:
  tasks:
    - name: hi
      retries: 2
      taskSpec:
        steps:
          - name: hi
            image: alpine:3.12
            script: |
              echo 'hi'
              sleep 10
  finally:
    - name: bye
      taskSpec:
        steps:
          - name: bye
            image: alpine:3.12
            script: |
              echo 'bye'
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: please-say-bye-
spec:
  timeouts:
    pipeline: 10s
    tasks: 5s
    finally: 5s
  pipelineRef:
    name: please-say-bye
❯ tkn -n pipelines tr ls --label tekton.dev/pipeline=please-say-bye

NAME                             STARTED          DURATION    STATUS
please-say-bye-tzjdt-bye-pft4x   25 seconds ago   1 second    Failed(TaskRunTimeout)
please-say-bye-tzjdt-hi-qqlgs    30 seconds ago   5 seconds   Failed(TaskRunTimeout)

❯ tkn -n pipelines pr desc please-say-bye-tzjdt
Name:              please-say-bye-tzjdt
Namespace:         pipelines
Pipeline Ref:      please-say-bye
Service Account:   default
Labels:
 tekton.dev/pipeline=please-say-bye

🌑️  Status

STARTED        DURATION     STATUS
1 minute ago   14 seconds   Failed

πŸ’Œ Message

Tasks Completed: 2 (Failed: 2, Cancelled 0), Skipped: 0 (TaskRun "please-say-bye-tzjdt-bye-pft4x" failed to finish within "1s")

πŸ“¦ Resources

 No resources

βš“ Params

 No params

πŸ“ Results

 No results

πŸ“‚ Workspaces

 No workspaces

πŸ—‚  Taskruns

 NAME                               TASK NAME   STARTED        DURATION    STATUS
 βˆ™ please-say-bye-tzjdt-bye-pft4x   bye         1 minute ago   1 second    Failed(TaskRunTimeout)
 βˆ™ please-say-bye-tzjdt-hi-qqlgs    hi          1 minute ago   5 seconds   Failed(TaskRunTimeout)

⏭️  Skipped Tasks

 No Skipped Tasks
γ‚Ήγ‚―γƒͺγƒΌγƒ³γ‚·γƒ§γƒƒγƒˆ 2021-07-02 17 59 45

Additional Info

  • Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.4-eks-6b7464", GitCommit:"6b746440c04cb81db4426842b4ae65c3f7035e53", GitTreeState:"clean", BuildDate:"2021-03-19T19:33:03Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:
Client version: 0.19.0
Pipeline version: v0.25.0
Triggers version: v0.14.0
Dashboard version: v0.17.0

ornew avatar Jul 02 '21 09:07 ornew

@ornew thanks for the issue. cc @souleb @pritidesai @sbwsg

vdemeester avatar Jul 02 '21 09:07 vdemeester

Hi @ornew thanks for the issue. The task timeout is processed at runtime from the pipelinerun start time. If you retry only a task, taking into consideration the pipelinerun start time, your task is already timedout. Today the coded logic is to give 1s to the task to run. I am actually not sure why we give that second...

souleb avatar Jul 02 '21 11:07 souleb

@souleb Thanks for checking this issue :)

When TaskRun retries, the Timeout is not recalculated but the StartTime is reseted. It looks to me the retried task can pass through HasTimedOut.

https://github.com/tektoncd/pipeline/blob/0e9d9e603347f85db7017180412ed25419307aa6/pkg/reconciler/pipelinerun/pipelinerun.go#L730 https://github.com/tektoncd/pipeline/blob/0e9d9e603347f85db7017180412ed25419307aa6/pkg/reconciler/pipelinerun/pipelinerun.go#L749 https://github.com/tektoncd/pipeline/blob/0e9d9e603347f85db7017180412ed25419307aa6/pkg/reconciler/pipelinerun/pipelinerun.go#L896 https://github.com/tektoncd/pipeline/blob/0e9d9e603347f85db7017180412ed25419307aa6/pkg/apis/pipeline/v1alpha1/taskrun_types.go#L231

One more... the following output is the status of the third attempt due to a timeout. The first and second attempts are in the same pod. Also, the second attempt ended with 255 instead of a timeout. It did not run until the timeout.

completionTime: '2021-07-09T12:54:11Z'
conditions:
  - lastTransitionTime: '2021-07-09T12:54:11Z'
    message: TaskRun "please-say-bye-62xd2-hi-wwzpm" failed to finish within "10s"
    reason: TaskRunTimeout
    status: 'False'
    type: Succeeded
podName: please-say-bye-62xd2-hi-wwzpm-pod-2tc4n
retriesStatus:
  - completionTime: '2021-07-09T12:54:00Z'
    conditions:
      - lastTransitionTime: '2021-07-09T12:54:00Z'
        message: TaskRun "please-say-bye-62xd2-hi-wwzpm" failed to finish within "10s"
        reason: TaskRunTimeout
        status: 'False'
        type: Succeeded
    podName: please-say-bye-62xd2-hi-wwzpm-pod-8s5nz
    startTime: '2021-07-09T12:53:50Z'
    steps:
      - container: step-hi
        imageID: >-
          docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e
        name: hi
        terminated:
          exitCode: 1
          finishedAt: '2021-07-09T12:54:00Z'
          reason: TaskRunTimeout
          startedAt: '2021-07-09T12:53:53Z'
    taskSpec:
      steps:
        - image: 'alpine:3.12'
          name: hi
          resources: {}
          script: |
            echo 'hi'
            #exit 1
            sleep 30
  - completionTime: '2021-07-09T12:54:01Z'
    conditions:
      - lastTransitionTime: '2021-07-09T12:54:01Z'
        message: >
          "step-hi" exited with code 255 (image:
          "docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e");
          for logs run: kubectl -n default logs
          please-say-bye-62xd2-hi-wwzpm-pod-8s5nz -c step-hi
        reason: Failed
        status: 'False'
        type: Succeeded
    podName: please-say-bye-62xd2-hi-wwzpm-pod-8s5nz
    startTime: '2021-07-09T12:54:00Z'
    steps:
      - container: step-hi
        imageID: >-
          docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e
        name: hi
        terminated:
          containerID: >-
            containerd://f86b959de7ee03bbf4e2b9df638d5d937981a863d4f124a74dcb3580535fcef9
          exitCode: 255
          finishedAt: '2021-07-09T12:54:00Z'
          reason: Error
          startedAt: '2021-07-09T12:53:54Z'
    taskSpec:
      steps:
        - image: 'alpine:3.12'
          name: hi
          resources: {}
          script: |
            echo 'hi'
            #exit 1
            sleep 30
startTime: '2021-07-09T12:54:01Z'
steps:
  - container: step-hi
    imageID: >-
      docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e
    name: hi
    terminated:
      exitCode: 1
      finishedAt: '2021-07-09T12:54:11Z'
      reason: TaskRunTimeout
      startedAt: '2021-07-09T12:54:04Z'
taskSpec:
  steps:
    - image: 'alpine:3.12'
      name: hi
      resources: {}
      script: |
        echo 'hi'
        #exit 1
        sleep 30

some dump

2021-07-09T14:26:36.180Z        debug   tekton-pipelines-controller     taskrun/taskrun.go:94   MARK_DEBUG pod_name=please-say-bye-s666f-hi-f5rg9-pod-7jgnk, timeout=&Duration{Duration:10s,} retry=0 has_started=true is_done=true is_cancelled=false has_timed_out=true       {"commit": "0e9d9e6", "knative.dev/controller": "github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "2e4788be-323f-4e61-8b52-5f47433e13a2", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}
2021-07-09T14:26:36.190Z        debug   tekton-pipelines-controller     pipelinerun/pipelinerun.go:731  MARK_DEBUG pod_name=, timeout=&Duration{Duration:10s,} retry=1 new_timeout=&{10000000000}       {"commit": "0e9d9e6", "knative.dev/controller": "github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler", "knative.dev/kind": "tekton.dev.PipelineRun", "knative.dev/traceid": "6f6a60c3-b5dc-4020-97ef-8dfadce1eccb", "knative.dev/key": "default/please-say-bye-s666f"}
2021-07-09T14:26:36.221Z        debug   tekton-pipelines-controller     taskrun/taskrun.go:94   MARK_DEBUG pod_name=, timeout=&Duration{Duration:10s,} retry=1 has_started=false is_done=false is_cancelled=false has_timed_out=false   {"commit": "0e9d9e6", "knative.dev/controller": "github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "eea9dc61-2f72-4718-910a-d6a82fc1d6e5", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}
2021-07-09T14:26:36.239Z        debug   tekton-pipelines-controller     taskrun/taskrun.go:420  MARK_DEBUG DidTaskRunFail name=please-say-bye-s666f-hi-f5rg9-pod-7jgnk phase=Running    {"commit": "0e9d9e6", "knative.dev/controller": "github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "eea9dc61-2f72-4718-910a-d6a82fc1d6e5", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}
2021-07-09T14:26:36.345Z        debug   tekton-pipelines-controller     taskrun/taskrun.go:94   MARK_DEBUG pod_name=please-say-bye-s666f-hi-f5rg9-pod-7jgnk, timeout=&Duration{Duration:10s,} retry=1 has_started=true is_done=false is_cancelled=false has_timed_out=false     {"commit": "0e9d9e6", "knative.dev/controller": "github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "c820bcf9-5bae-4526-bffe-8f0d41289baf", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}

This is my speculation (I'm not sure about the details)

  • Reset TaskRun status when retry
  • TaskRun reconcile occurs
  • Pass throgh HasTimedOut because there is no status
  • Get the pod faster than the pod status changed due to a delay of timeout, and reuse it (what is delay???)
  • However, the pod will exit with 255 soon

https://github.com/tektoncd/pipeline/blob/0e9d9e603347f85db7017180412ed25419307aa6/pkg/reconciler/taskrun/taskrun.go#L417

ornew avatar Jul 09 '21 15:07 ornew

What you describe is basically what happens.

Also, you are right that the Timeout is not recalculated. But the time that is allocated to your taskrun is. And this allocated time calculation take into account several parameters, including the pipelinerun and taskrun Timeout and StartTime.

https://github.com/tektoncd/pipeline/blob/0e9d9e603347f85db7017180412ed25419307aa6/pkg/reconciler/pipelinerun/pipelinerun.go#L969

https://github.com/tektoncd/pipeline/blob/0e9d9e603347f85db7017180412ed25419307aa6/pkg/reconciler/pipelinerun/pipelinerun.go#L1018

It takes the declared taskrun Timeout, add it to the StartTime, and check if we are behind schedule. In case we are behind schedule, it still allocate 1s to run.

Does it make more sense?

souleb avatar Jul 16 '21 15:07 souleb

Yes, I saw what was happening. This is because it is not the expected behavior. I checked whether it was intended.

The problem for me is that Tekton still doesn't guarantee the execution of finally at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing timeout, and I expected the implementation of the new timeouts to be reliable. But it still doesn't work.

I think it's counterintuitive that the task exceeds the time given and that finally runs for only one second. Could you rethink prioritizing finally task execution over task (and retried it)? I think it's worth it.

ornew avatar Aug 07 '21 15:08 ornew

If you look at the related tep TEP-46, you will see what you talk about in the alternative section. There have been a lot of discussion around whether the execution of finally tasks must be guaranteed. If I remember well I think we agreed that the new timeouts would give enough flexibility to everyone to ensure that their tasks runs as wanted. Would you mind sharing the production uses cases where that is not the case? also maybe @pritidesai you may want to add something here.

souleb avatar Aug 07 '21 17:08 souleb

The real problem for us is that finally execution is not guaranteed.

But I'm not saying that the timeout should guarantee the execution of finally. I would like to know out whether the reason why finally is not running is motivated by a proper and intentional reason.

The TEP-46 said:

The users have an ability to specify the timeout for the tasks section.

Actually, it run tasks beyond timeouts.tasks when retries occur. Does this behavior match the TEP description and intuition, really?

I'm sorry I haven't been able to follow all the discussions. I was checked TEPs in advance as much as possible, I cannot find that why it allows that the tasks execution to exceed the timeouts.tasks. If there is good reason for it to be allowed, it should be documented in the TEP.

For example, our use cases corresponds to the dogfooding scenario shown in the TEP.

Cleanup cluster resources after finishing (with success/failure) integration tests (Dogfooding Scenario)

A simplification of the pipeline that actually caused the problem would look like this:

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: our-use-case-example-
spec:
  timeouts:
    # We are explicitly setting timeouts
    pipeline: 5h
    tasks: 4h
    finally: 1h
  pipelineSpec:
    tasks:
      - name: create-resources
        retries: 2
        taskSpec:
          steps:
            - name: create-resources
              script: |
                # Allocate a lot of cluster resources.
                # For examples:
                # - run AWS EMR clusters for processing big data
                # - run many pod for load testing before deployment
      - name: heavy-workload
        retries: 2
        taskSpec:
          steps:
            - name: workload
              script: |
                # This may be retried due to an external system...
    finally:
      - name: cleanup-all-resources
        taskSpec:
          steps:
            - name: cleanup
              script: |
                # Cleanup all resources. We expect this to be run.
                # We give the task 4 hours and the pipeline 5 hours. There is an hour grace period.
                # However, if a retry occurs, this is now almost always given only 1 second.
                # By not recalculating the task timeout, the overall task execution time can exceed 4 hours.
                # In this case, the task will run for up to 5 hours. (why? We have specified timeouts.tasks as 4 hours.)
                # It does not behave intuitively with respect to the given definition.
                # As a result, some resources will not be released.

I've seen why the task runs for more than 4 hours and explained the actual problem caused by not run the finally.

Of course, we are considering the possibility that finally will not be executed. In reality, we perform external monitoring that releases resources, but these resources and monitoring are generally very expensive. It costs enough to expect the timeout to complete with the specified value regardless of the retry.

ornew avatar Aug 08 '21 14:08 ornew

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Nov 06 '21 14:11 tekton-robot

/assign lbernick /assign jerop

lbernick avatar Nov 08 '21 22:11 lbernick

/remove-lifecycle stale

@lbernick and I are looking into it

jerop avatar Nov 08 '21 22:11 jerop

Just want to update with a summary of the work already done and that needs to be done:

  • we've updated timeouts.tasks and timeouts.finally to account for time already spent during the pipelinerun.
  • We need to update the pipelinerun controller to set a new timeout in the taskrun spec on every attempt. (We currently just update the status, including the start time.) Leaving the TaskRun's start time alone when creating a new retry will not work, as the timeout must be recalculated each time by the pipelinerun controller based on how much time has passed during execution of the pipelinerun's tasks, and the TaskRun's timeout reflects the minimum of that number and the timeout set for the pipeline task itself.
  • We need to update the function that sets timeouts for TaskRuns contained in pipelines to correctly handle the situation where a pipelinerun has already timed out, but a new task is being created anyway. Currently, it will create a new taskrun with a timeout of 1s, but no new taskruns should be created, and an error should be returned. If this is not done, we will keep creating retries with a 1s timeout even after pipeline-level timeouts have passed.
  • Once this is complete, we need to correctly account for time spent during retries for timeouts.finally (addressed in #4508).

Unfortunately Jerop and I don't have the bandwidth to prioritize this issue right now.

/unassign /unassign @jerop /help /lifecycle frozen

lbernick avatar Apr 07 '22 18:04 lbernick

@lbernick: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

Just want to update with a summary of the work already done and that needs to be done:

  • we've updated timeouts.tasks and timeouts.finally to account for time already spent during the pipelinerun.
  • We need to update the pipelinerun controller to set a new timeout in the taskrun spec on every attempt. (We currently just update the status, including the start time.) Leaving the TaskRun's start time alone when creating a new retry will not work, as the timeout must be recalculated each time by the pipelinerun controller based on how much time has passed during execution of the pipelinerun's tasks, and the TaskRun's timeout reflects the minimum of that number and the timeout set for the pipeline task itself.
  • We need to update the function that sets timeouts for TaskRuns contained in pipelines to correctly handle the situation where a pipelinerun has already timed out, but a new task is being created anyway. Currently, it will create a new taskrun with a timeout of 1s, but no new taskruns should be created, and an error should be returned. If this is not done, we will keep creating retries with a 1s timeout even after pipeline-level timeouts have passed.
  • Once this is complete, we need to correctly account for time spent during retries for timeouts.finally (addressed in #4508).

Unfortunately Jerop and I don't have the bandwidth to prioritize this issue right now.

/unassign /unassign @jerop /help /lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot avatar Apr 07 '22 18:04 tekton-robot

/assign @vsinghai

vsinghai avatar May 25 '22 17:05 vsinghai

I'm returning to this issue to see if it has been resolved by #5134 (FYI @abayer).

@ornew, I'm curious why you say in your original comment that the finally task should not be timed out if timeouts.pipeline is exceeded. I think not running the finally task is intended behavior, as timeouts.pipeline should refer to the entire time the pipeline is running. I tried using your example posted in the original comment, and it results in the pipelinerun being timed out after 10 seconds and finally tasks not being run, which I believe is the correct behavior, as the pipelinerun should stop running after 10s.

If you'd like to allow the finally tasks to run indefinitely but have the tasks section time out after some time, I think you need to specify timeouts.tasks = 5s and timeouts.pipeline = 0 (no timeout). (Unfortunately this doesn't work-- filed https://github.com/tektoncd/pipeline/issues/5459 but should be easily fixable.

There's still a bug, though, with the retried taskrun not being timed out when timeouts.tasks is reached. Here's a reproducer:

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: please-say-bye-again
spec:
  tasks:
    - name: hi
      retries: 2
      taskSpec:
        steps:
          - name: hi
            image: alpine:3.12
            script: |
              echo 'hi'
              sleep 10
              exit 1
  finally:
    - name: bye
      taskSpec:
        steps:
          - name: bye
            image: alpine:3.12
            script: |
              echo 'bye'
---
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: please-say-bye-again-
spec:
  timeouts:
    pipeline: 1m
    tasks: 19s
  pipelineRef:
    name: please-say-bye-again

In this example, each attempt (there are 3) sleeps for 10s and fails, the finally task is run, and the pipelinerun fails. I would expect that the first attempt fails and retries, the taskrun is canceled before the second attempt completes, the finally task runs, and the pipelinerun fails.

lbernick avatar Sep 08 '22 19:09 lbernick

@lbernick

I'm curious why you say in your original comment that the finally task should not be timed out if timeouts.pipeline is exceeded.

I agree you that finally task should be timeout if timeouts.pipeline is exceeded.

This issue reported originally that timeouts.tasks does not work as expected by the user. Tasks are retried even though timeouts.tasks has been exceeded until timeouts.pipeline is exceeded.

When I said "finally should not timeout" here, I meant the user's expected behavior in the context of "as a result of timeouts.pipeline being consumed due to incorrect execution of timeouts.tasks".

ornew avatar Sep 14 '22 14:09 ornew

got it, that makes sense, thanks!

lbernick avatar Sep 14 '22 20:09 lbernick

I think this should have been fixed in #5807: https://github.com/tektoncd/pipeline/pull/5807#issuecomment-1331103996

lbernick avatar Jan 18 '23 16:01 lbernick