community
community copied to clipboard
TEP-0121: Refine Retries for TaskRuns and CustomRuns
Retries is designed inconsistently between TaskRun and CustomRun, see the Motivation section for details. In this TEP, we propose to align the inconsistency by
- solidifying
TimeOut
is for each retry attempt, updating TEP-069 and docs/runs.md to reflect the alignment. - migrating
Retries
implementation to TaskRun reconciler.
/kind tep
/assign
/assign @pritidesai
Pushed a new commit comparing related works.
I'm moving this TEP to implementable
as we've fully discussed the options in Tekton API Working Group, PTAL~
- We need to document clearly the end condition of a Custom Run.
- For now, we rely on both
Condition
andretriesStatus
:IsDone() & len(retriesStatus)==retries
- And it is not aligned with what we documented, because the pr controller thinks the run should be retried when it times out.
- An alternative: make the decision solely based on
Condition
.- consider removing
retriesStatus
from run status because we'd like to give customers as much flexibility as possible.
- consider removing
- For now, we rely on both
- Do we want to change the documentation of
v1beta1
orboth v1alpha1 and v1beta1
?
A piece of background information: the current implementation is different (timeout for per retry) from what documented (timeout for all retry attempts), but is aligned with TaskRun behavior.
The current end condition when a Run fails:
// isFailure returns true only if the run has failed and will not be retried.
// If the PipelineTask has a Matrix, isFailure returns true if any run has failed (no remaining retries)
// and all other runs are done.
func (t ResolvedPipelineTask) isFailure() bool {
//...
return isDone && c.IsFalse() && !t.hasRemainingRetries()
}
// IsDone returns true if the Run's status indicates that it is done.
func (r *Run) IsDone() bool {
return !r.Status.GetCondition(apis.ConditionSucceeded).IsUnknown()
}
We propose to not rely on the retriesStatus
as part of the end condition, instead, instruct CustomRun users that DO NOT update run condition to ConditionSucceeded
until all retries be exhausted.
@lbernick, @pritidesai , @abayer wdyt?
cc @dibyom
thanks Xinru for this summary! I think we should remove status.retriesStatus
from v1beta1.CustomRun instead of only changing the docs and PR reconciler code.
thanks Xinru for this summary! I think we should remove
status.retriesStatus
from v1beta1.CustomRun instead of only changing the docs and PR reconciler code.
Thanks @lbernick! Right, updated the sumary to reflect the proposal.
Thanks, @XinruZhang for working on this. I definitely agree we need some rework in this area. My preferred option would actually be the opposite (see my comment in https://github.com/tektoncd/community/pull/816/files#r985763039) - but of course, I'd be happy to be convinced otherwise.
Hi everyone, I'm moving this to proposed, hope we can check it in and iterate it later on. LMK if you have other concerns.
cc @afrittoli @pritidesai @lbernick @jerop @dibyom
thanks Xinru! Since we haven't come to a consensus yet on the way forward, maybe it would be better to just clearly list out the alternatives instead of specifying one as the proposal? (even if it's in a proposed state)
Thanks @pritidesai @jerop @lbernick for your detailed review, appreciate it a lot! ❤️ I Pushed a new commit to reflect what we've discussed in the comment thread, PTAL :)
/hold Talking with @pritidesai offline about some important details of Option 1, related links:
- https://github.com/tektoncd/pipeline/blob/main/docs/variables.md
- https://github.com/tektoncd/pipeline/pull/3770
Will update the TEP soon.
Hello folks, I am just trying to catch up with all the updates and might have missed some. Do we ensure that a custom task treats that timeout is for each retry and how?
Hi @ScrapCodes, thanks for the comment! Yes, that's a great question. I don't think we have a way to enforce that, this is more of a contract we build with our customers: i.e. if you don't implement the custom task controller in this way, the pipeline controller may not work as intended.
We(me and Prashant) have reached agreement in an offline discussion on last Friday, see the meeting note. Thanks for the great discussion! @ScrapCodes
thanks @XinruZhang for all your efforts and patience 🤗
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: jerop, lbernick, pritidesai
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~teps/OWNERS~~ [jerop,lbernick,pritidesai]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
LGTM since have all the approvals now
/lgtm