community icon indicating copy to clipboard operation
community copied to clipboard

TEP-0121: Refine Retries for TaskRuns and CustomRuns

Open XinruZhang opened this issue 2 years ago • 12 comments

Retries is designed inconsistently between TaskRun and CustomRun, see the Motivation section for details. In this TEP, we propose to align the inconsistency by

  1. solidifying TimeOut is for each retry attempt, updating TEP-069 and docs/runs.md to reflect the alignment.
  2. migrating Retries implementation to TaskRun reconciler.

XinruZhang avatar Sep 08 '22 15:09 XinruZhang

/kind tep

jerop avatar Sep 08 '22 16:09 jerop

/assign

lbernick avatar Sep 08 '22 16:09 lbernick

/assign @pritidesai

lbernick avatar Sep 12 '22 16:09 lbernick

Pushed a new commit comparing related works.

XinruZhang avatar Sep 13 '22 12:09 XinruZhang

I'm moving this TEP to implementable as we've fully discussed the options in Tekton API Working Group, PTAL~

XinruZhang avatar Sep 21 '22 17:09 XinruZhang

  1. We need to document clearly the end condition of a Custom Run.
    • For now, we rely on both Condition and retriesStatus: IsDone() & len(retriesStatus)==retries
      • And it is not aligned with what we documented, because the pr controller thinks the run should be retried when it times out.
    • An alternative: make the decision solely based on Condition.
      • consider removing retriesStatus from run status because we'd like to give customers as much flexibility as possible.
  2. Do we want to change the documentation of v1beta1 or both v1alpha1 and v1beta1?

A piece of background information: the current implementation is different (timeout for per retry) from what documented (timeout for all retry attempts), but is aligned with TaskRun behavior.

The current end condition when a Run fails:

// isFailure returns true only if the run has failed and will not be retried.
// If the PipelineTask has a Matrix, isFailure returns true if any run has failed (no remaining retries)
// and all other runs are done.
func (t ResolvedPipelineTask) isFailure() bool {
	//...
	return isDone && c.IsFalse() && !t.hasRemainingRetries()
}

// IsDone returns true if the Run's status indicates that it is done.
func (r *Run) IsDone() bool {
 return !r.Status.GetCondition(apis.ConditionSucceeded).IsUnknown()
}

We propose to not rely on the retriesStatus as part of the end condition, instead, instruct CustomRun users that DO NOT update run condition to ConditionSucceeded until all retries be exhausted.

@lbernick, @pritidesai , @abayer wdyt?

cc @dibyom

XinruZhang avatar Sep 26 '22 17:09 XinruZhang

thanks Xinru for this summary! I think we should remove status.retriesStatus from v1beta1.CustomRun instead of only changing the docs and PR reconciler code.

lbernick avatar Sep 26 '22 18:09 lbernick

thanks Xinru for this summary! I think we should remove status.retriesStatus from v1beta1.CustomRun instead of only changing the docs and PR reconciler code.

Thanks @lbernick! Right, updated the sumary to reflect the proposal.

XinruZhang avatar Sep 26 '22 18:09 XinruZhang

Thanks, @XinruZhang for working on this. I definitely agree we need some rework in this area. My preferred option would actually be the opposite (see my comment in https://github.com/tektoncd/community/pull/816/files#r985763039) - but of course, I'd be happy to be convinced otherwise.

afrittoli avatar Oct 03 '22 13:10 afrittoli

Hi everyone, I'm moving this to proposed, hope we can check it in and iterate it later on. LMK if you have other concerns.

cc @afrittoli @pritidesai @lbernick @jerop @dibyom

XinruZhang avatar Oct 03 '22 19:10 XinruZhang

thanks Xinru! Since we haven't come to a consensus yet on the way forward, maybe it would be better to just clearly list out the alternatives instead of specifying one as the proposal? (even if it's in a proposed state)

lbernick avatar Oct 04 '22 14:10 lbernick

Thanks @pritidesai @jerop @lbernick for your detailed review, appreciate it a lot! ❤️ I Pushed a new commit to reflect what we've discussed in the comment thread, PTAL :)

XinruZhang avatar Oct 17 '22 18:10 XinruZhang

/hold Talking with @pritidesai offline about some important details of Option 1, related links:

  • https://github.com/tektoncd/pipeline/blob/main/docs/variables.md
  • https://github.com/tektoncd/pipeline/pull/3770

Will update the TEP soon.

XinruZhang avatar Oct 20 '22 20:10 XinruZhang

Hello folks, I am just trying to catch up with all the updates and might have missed some. Do we ensure that a custom task treats that timeout is for each retry and how?

ScrapCodes avatar Oct 26 '22 17:10 ScrapCodes

Hi @ScrapCodes, thanks for the comment! Yes, that's a great question. I don't think we have a way to enforce that, this is more of a contract we build with our customers: i.e. if you don't implement the custom task controller in this way, the pipeline controller may not work as intended.

XinruZhang avatar Oct 26 '22 17:10 XinruZhang

We(me and Prashant) have reached agreement in an offline discussion on last Friday, see the meeting note. Thanks for the great discussion! @ScrapCodes

XinruZhang avatar Oct 31 '22 20:10 XinruZhang

thanks @XinruZhang for all your efforts and patience 🤗

/approve

pritidesai avatar Nov 03 '22 19:11 pritidesai

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jerop, lbernick, pritidesai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

tekton-robot avatar Nov 03 '22 19:11 tekton-robot

LGTM since have all the approvals now

/lgtm

dibyom avatar Nov 03 '22 19:11 dibyom