kargo icon indicating copy to clipboard operation
kargo copied to clipboard

Step is skipped but should be retried

Open AlanLonguet opened this issue 1 month ago • 1 comments

Checklist

  • [X] I've searched the issue queue to verify this is not a duplicate bug report.
  • [X] I've included steps to reproduce the bug.
  • [X] I've pasted the output of kargo version.
  • [X] I've pasted logs, if applicable.

Description

Steps that return PromotionStepStatusErrored are not retried when if field is empty

Screenshots

Steps to Reproduce

  • Create this ClusterPromotionTask
apiVersion: kargo.akuity.io/v1alpha1
kind: ClusterPromotionTask
metadata:
  name: test-skipped
spec:
  steps:
    - as: gitClone
      config:
        checkout:
          - branch: ${{ vars.gitOpsRepoBranch }}
             path: ./src
        repoURL: ${{ vars.gitOpsRepoURL }}
      retry:
        errorThreshold: 3
      uses: git-clone
  vars:
    - name: gitOpsRepoURL
      value: <NON_EXISTING_REPO_URL> 
    - name: gitOpsRepoBranch
      value: main
  • Create a stage that uses this promotion task
apiVersion: kargo.akuity.io/v1alpha1
kind: Stage
metadata:
  name: test-skipped-step
spec:
  promotionTemplate:
    spec:
      steps:
        - task:
            kind: ClusterPromotionTask
            name: test-skipped
  requestedFreight:
    - origin:
        kind: Warehouse
        name: <EXISTING_WAREHOUSE>
      sources:
        direct: true
        stages: []
  • Launch a promotion

The first run of the step will result in this status :

  stepExecutionMetadata:
    - alias: task-1::gitClone
      errorCount: 1
      message: 'error running step "task-1::gitClone": error getting credentials for
        <REPO_NAME>: error getting installation
        access token: error getting installation access token: POST
        <API>: 422
        There is at least one repository that does not exist or is not
        accessible to the parent installation. []; step will be retried'
      startedAt: 2025-11-12T15:34:34Z
      status: Errored

The second run of the step will result in this status :

  stepExecutionMetadata:
    - alias: task-1::gitClone
      errorCount: 1
      message: 'error running step "task-1::gitClone": error getting credentials for
        <REPO_NAME>: error getting installation
        access token: error getting installation access token: POST
        <API>: 422
        There is at least one repository that does not exist or is not
        accessible to the parent installation. []; step will be retried'
      startedAt: 2025-11-12T15:34:34Z
      status: Skipped

Investigation

We investigated a bit on how this could happen, we discovered that by adding if: ${{ always() }} the retry behavior was working correctly.

The culprit seems easy to find :

// Skip returns true if the Step should be skipped based on the If condition.
// The If condition is evaluated against the provided Context.
func (s *Step) Skip(
	ctx context.Context,
	cl client.Client,
	cache *gocache.Cache,
	promoCtx Context,
) (bool, error) {
	// If no "if" condition is provided, then this step is automatically skipped
	// if any of the previous steps have errored or failed and is not skipped
	// otherwise.
	if s.If == "" {
		return promoCtx.StepExecutionMetadata.HasFailures(), nil
	}
...

The issue for us here is that a failed step will always (Not true for steps that return a Running status) be skipped since the StepExecutionMetadata has failures in its definition, see the first status of the reproduction.

For us the fix would need to be located here :

func (e *simpleEngine) determineStepCompletion(
	step Step,
	runnerMeta StepRunnerMetadata,
	execMeta *kargoapi.StepExecutionMetadata,
	err error,
) bool {
...
	if err != nil {
		// Treat Errored/Failed as if the step is still running so that the
		// Promotion will be requeued. The step will be retried on the next
		// reconciliation.
                
->      // Shouldn't we override the execMeta.Status with kargoapi.PromotionStepStatusRunning here ?
		execMeta.Message += "; step will be retried"
		return false
	}

	// If we get to here, the step is still Running (waiting for some external
	// condition to be met).
	execMeta.ErrorCount = 0 // Reset the error count
	return false
}

Version

1.8.3

Logs

Tell me if logs are revelant in this case

AlanLonguet avatar Nov 12 '25 15:11 AlanLonguet

This issue seems to match the same pattern, we can either keep this one or the other one if you prefer

AlanLonguet avatar Nov 12 '25 16:11 AlanLonguet