pipeline Cancel TaskRuns using entrypoint binary

Today

When a TaskRun is cancelled, the TaskRun controller deletes the TaskRun's underlying Pod. This stops execution ~immediately, but also leads Kubernetes to reap the Pod's logs.

Feature Request

In #2559, we're discussing enforcing TaskRun-level timeouts in the entrypoint binary, so that timed-out TaskRun Pods don't get deleted and any logs lost. Instead of deleting the Pod, the entrypoint binary that runs each step will just stop executing and fail any running step, and not run any subsequent steps.

If we end up doing that, we could also enforce cancellation in the entrypoint binary, which would let us keep Pods and logs around for cancelled TaskRuns too.

To accomplish this, the entrypoint binary could take a new flag -cancel_file, which is a Downard API volume populated from a Pod annotation -- this is similar to how we signal the first step to start only after all sidecars are ready. In this model, when a TaskRun is cancelled, the TaskRun controller would annotate the Pod with, for example "cancelled=true", which would update the contents of the projected file, which the entrypoint binary would see, then it can stop executing the currently running step.

This behavior change should be guarded by a feature flag (opt-in at first) since some users might depend on the current behavior. This also gives us an opportunity to compare behavior and timing of cancellation between the two implementations.

/kind feature

Sep 15 '20 14:09 imjasonh

Makes sense to me to revisit this!

@ImJasonH are there any other options that are worth evaluating for this? For example, we could send a signal that the entrypoint binary could catch - iirc the only reason we haven't relied more on signals is b/c we haven't been able to rely on how some arbitrary process will handle it; we can control the entrypoint binary tho 🤔

I think @sbwsg has looked into this as well in the context of the initial cancellation feature

Sep 15 '20 17:09 bobcatfish

Can we send signals to containers in the pod from the controller? I didn't think that was an option, so we've relied on file-existence checks backed by Downard volumes instead.

We've also discussed having a sidecar in the Pod that can accept RPCs from the controller, but that's a much larger change, and ultimately sort of orthogonal to how entrypoint stops the user's entrypoint execution.

Sep 15 '20 18:09 imjasonh

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

Jan 03 '21 16:01 tekton-robot

/remove-lifecycle stale /lifecycle frozen

Putting this in the "frozen" box as this is something that is worth exploring 🙃

Jan 05 '21 15:01 vdemeester

Is this still a work in progress, I would love to implement it

Jan 29 '23 14:01 chengjoey

@chengjoey I didn't have time to keep the PR up-to-date, etc.. so yes, please go ahead 🙏🏼

Feb 03 '23 09:02 vdemeester

/assign

Apr 07 '23 14:04 chengjoey