pipeline Design: Partial Pipeline execution

The work for this task is to design this feature and present one or more proposals (before implementing).

Expected Behavior

If a pipeline has many tasks and takes a long time to run (e.g. tens of minutes, or even hours), and one Task fails, it might be desirable to be able to pick up execution where the Task failed, with different PipelineParams (e.g. from a different git commit), so you can resume the Pipeline without having to rerun the whole thing.

Some ideas for how to implement this:

Fields in a PipelineRun which override which Tasks to run from / refer to a previous PipelineRun from which results should be taken
A tool which makes it easy to create a new Pipeline from an existing one which only runs a subset of the Tasks

It is also worth considering what this could be like via a UI: if one is viewing a Pipeline in a UI, and wants to re-run only a portion of the Pipeline, they probably want the user experience to be as if they were still running the same Pipeline, even if underneath a new Pipeline is created.

Actual Behavior

At the moment, if any Task in a Pipeline fails, your options to rerun the rest of the Pipeline would be:

Run the entire Pipeline again
Create a new Pipeline from the previous one which contains only the Tasks you wish to run

Additional Info

This originally came up in discussion about #39, in the context of whether or not we'd want to always use the same git commit from a source for all Tasks in a Pipeline, or if we wanted sometimes for a Task to always use HEAD. This would allow a user to change a repo, by updating HEAD, between Task executions.

The feature of partial pipeline execution could be an alternative to this.

Sep 16 '18 21:09 bobcatfish

@BenTheElder, @cjwagner and some other Prow folks indicated that this would be a very desirable feature for them - particularly in a case where your pipeline has 2 phases, one that builds a bunch of stuff and then subsequent phases that use that built stuff, it'd be handy to be able to resume after the point where the stuff is built

Sep 19 '18 20:09 bobcatfish

Just to add (or rather try to help clarify) a use case here.

This is a very useful feature for long-running pipelines that probably fall outside the strict CI scope. Most pipelines I have in mind are essentially workflow automation pipelines and have external dependencies such as 3rd party systems that need to be up / reachable.

When such a pipeline fails at step 7/11, you really don't want to rerun the whole thing. The Jenkins Restart from stage feature is ideal for the pipeline to essentially pick up where it left off.

The problem with most Jenkins pipelines is that they are not written in such a way that restarting from any particular stage would be possible, as inputs / outputs of each stage (task) are not always well defined.

Coming to Tekton and finding inputs/outputs so explicitly declared, I almost see an opportunity whereby, once this feature is implemented, it will work on "all" Tekton pipelines, significantly widening the scope of problems tekton pipelines can be used to solve. (Plus anyone who relies on this on Jenkins will find it easier to migrate to Tekton).

As a final point, I would like to clarify that in my use case, support for restarting with "different PipelineParams" (as mentioned in the description) is not a necessary feature. I am sure people have use cases for that too, but I personally like the approach Jenkins takes here: you can either restart the whole pipeline with different params (new pipeline run), or restart from stage, when it failed, always with the same params (retry failed pipeline run, starting from failed task).

Hope this helps.

Apr 15 '20 20:04 gsaslis

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Aug 14 '20 00:08 tekton-robot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

Aug 14 '20 00:08 tekton-robot

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

Aug 14 '20 00:08 tekton-robot

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 14 '20 00:08 tekton-robot

/remove-lifecycle rotten /remove-lifecycle stale /reopen

Aug 17 '20 09:08 vdemeester

@vdemeester: Reopened this issue.

In response to this:

/remove-lifecycle rotten /remove-lifecycle stale /reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 17 '20 09:08 tekton-robot

This one is on our roadmap: https://github.com/tektoncd/pipeline/blob/master/roadmap.md

/lifecycle frozen

Aug 17 '20 15:08 bobcatfish

The problem with most Jenkins pipelines is that they are not written in such a way that restarting from any particular stage would be possible, as inputs / outputs of each stage (task) are not always well defined.

Coming to Tekton and finding inputs/outputs so explicitly declared, I almost see an opportunity whereby, once this feature is implemented, it will work on "all" Tekton pipelines

I think where this falls apart is lacking copy-on-write workspaces. Tasks can easily introduce changes to workspaces which makes execution non-idempotent. If instead workspaces were always inputs xor outputs the input to any given stage would still exist when the retry is attempted.

Sep 01 '20 18:09 coryrc

input/output workspace layering cannot be done efficiently without copy-on-write workspaces. It could be done with an NFS server and overlay filesystems, or there appear to be some COW volumes but I do not know enough about k8s to say if it's usable for this.

Sep 01 '20 18:09 coryrc

Tasks can easily introduce changes to workspaces which makes execution non-idempotent

I think there will be use cases where folks want to make these kinds of non-idempotent changes to workspaces, so even with COW I'm not sure we could fully solve this problem? If I'm wrong it would probably help if you could explain with an example.

Also: it sounds like COW workspaces would be an interesting feature in general - if you feel motivated it'd be great to have a separate issue to dive into this in detail

Sep 02 '20 21:09 bobcatfish

Quick update here: @jerop created a design for #1797 which has some interesting ideas that could be applied to a design for partial execution (design doc).

Jan 06 '21 16:01 bobcatfish

TEP-0123 Specifying on-demand-retry in a pipelineTask does not offer solution for this feature. But proposes a feature to allow specifying on-demand-retry at the authoring time.

Sep 20 '22 18:09 pritidesai

Can Tekton Pipeline provide the functionality to rerun failed tasks in a pipeline? This would be very useful for our scenario, where a complex pipeline fails on the last task, requiring manual intervention to manually rerun the failed task. GitHub Actions has a similar feature.

Dec 22 '23 09:12 jwx0925

pipeline pipeline copied to clipboard

Design: Partial Pipeline execution

Expected Behavior

Actual Behavior

Additional Info

pipeline
pipeline copied to clipboard