Step is skipped but should be retried
Checklist
- [X] I've searched the issue queue to verify this is not a duplicate bug report.
- [X] I've included steps to reproduce the bug.
- [X] I've pasted the output of
kargo version. - [X] I've pasted logs, if applicable.
Description
Steps that return PromotionStepStatusErrored are not retried when if field is empty
Screenshots
Steps to Reproduce
- Create this ClusterPromotionTask
apiVersion: kargo.akuity.io/v1alpha1
kind: ClusterPromotionTask
metadata:
name: test-skipped
spec:
steps:
- as: gitClone
config:
checkout:
- branch: ${{ vars.gitOpsRepoBranch }}
path: ./src
repoURL: ${{ vars.gitOpsRepoURL }}
retry:
errorThreshold: 3
uses: git-clone
vars:
- name: gitOpsRepoURL
value: <NON_EXISTING_REPO_URL>
- name: gitOpsRepoBranch
value: main
- Create a stage that uses this promotion task
apiVersion: kargo.akuity.io/v1alpha1
kind: Stage
metadata:
name: test-skipped-step
spec:
promotionTemplate:
spec:
steps:
- task:
kind: ClusterPromotionTask
name: test-skipped
requestedFreight:
- origin:
kind: Warehouse
name: <EXISTING_WAREHOUSE>
sources:
direct: true
stages: []
- Launch a promotion
The first run of the step will result in this status :
stepExecutionMetadata:
- alias: task-1::gitClone
errorCount: 1
message: 'error running step "task-1::gitClone": error getting credentials for
<REPO_NAME>: error getting installation
access token: error getting installation access token: POST
<API>: 422
There is at least one repository that does not exist or is not
accessible to the parent installation. []; step will be retried'
startedAt: 2025-11-12T15:34:34Z
status: Errored
The second run of the step will result in this status :
stepExecutionMetadata:
- alias: task-1::gitClone
errorCount: 1
message: 'error running step "task-1::gitClone": error getting credentials for
<REPO_NAME>: error getting installation
access token: error getting installation access token: POST
<API>: 422
There is at least one repository that does not exist or is not
accessible to the parent installation. []; step will be retried'
startedAt: 2025-11-12T15:34:34Z
status: Skipped
Investigation
We investigated a bit on how this could happen, we discovered that by adding if: ${{ always() }} the retry behavior was working correctly.
The culprit seems easy to find :
// Skip returns true if the Step should be skipped based on the If condition.
// The If condition is evaluated against the provided Context.
func (s *Step) Skip(
ctx context.Context,
cl client.Client,
cache *gocache.Cache,
promoCtx Context,
) (bool, error) {
// If no "if" condition is provided, then this step is automatically skipped
// if any of the previous steps have errored or failed and is not skipped
// otherwise.
if s.If == "" {
return promoCtx.StepExecutionMetadata.HasFailures(), nil
}
...
The issue for us here is that a failed step will always (Not true for steps that return a Running status) be skipped since the StepExecutionMetadata has failures in its definition, see the first status of the reproduction.
For us the fix would need to be located here :
func (e *simpleEngine) determineStepCompletion(
step Step,
runnerMeta StepRunnerMetadata,
execMeta *kargoapi.StepExecutionMetadata,
err error,
) bool {
...
if err != nil {
// Treat Errored/Failed as if the step is still running so that the
// Promotion will be requeued. The step will be retried on the next
// reconciliation.
-> // Shouldn't we override the execMeta.Status with kargoapi.PromotionStepStatusRunning here ?
execMeta.Message += "; step will be retried"
return false
}
// If we get to here, the step is still Running (waiting for some external
// condition to be met).
execMeta.ErrorCount = 0 // Reset the error count
return false
}
Version
1.8.3
Logs
Tell me if logs are revelant in this case
This issue seems to match the same pattern, we can either keep this one or the other one if you prefer