AnalysisRun succeeds even when Job fails
Checklist:
- [x] I've included steps to reproduce the bug.
- [x] I've included the version of argo rollouts.
Describe the bug
We have setup an AnalysisTemplate that will kick off a Job to determine the rollout health. The Job will exit with 0 if successful or 1 if failure.
- The Job status is working properly and showing "Failed"
- The parent AnalysisRun is not working properly, as its showing "Successful" instead of "Failed"
- This causes the rollout to continue to progress and become "stable"
To Reproduce
Example configuration:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: container-exit-code-check
spec:
metrics:
- name: container-check
# The 'count' field ensures the metric is evaluated only once.
count: 1
# The 'timeout' ensures the job does not run indefinitely.
timeout: 5m
failureLimit: 1
provider:
job:
# Define the Job specification to run your container
spec:
# Set backoffLimit to 0 to prevent retries for failed jobs.
# This ensures the Job is marked as failed immediately.
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: check-container
image: busybox
imagePullPolicy: IfNotPresent
# This command is configured to fail for testing purposes.
command: ["sh", "-c", "echo 'Running analysis container...'; exit 1;"]
# The failure condition now checks the aggregated metric results.
failureCondition: "metricResults[0].failed > 0"
Expected behavior
We would expect the AnalysisRun to be marked as "Failed" since the child Job is marked as "Failed"
Screenshots
Version
v1.8.1+1ad2c6a
Logs
# Paste the logs from the rollout controller
time="2025-08-05T16:37:06Z" level=info msg="rollout enqueue due to update event" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:06Z" level=info msg="Patched: {\"status\":{\"canary\":{\"currentStepAnalysisRunStatus\":{\"status\":\"Running\"}}}}" generation=71 namespace=test resourceVersion=126396616 rollout=testapp-rollout
time="2025-08-05T16:37:06Z" level=info msg="rollout enqueue due to update event" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:06Z" level=info msg="persisted to informer" generation=71 namespace=test resourceVersion=126396619 rollout=testapp-rollout
time="2025-08-05T16:37:06Z" level=info msg="Reconciliation completed" generation=71 namespace=test resourceVersion=126396616 rollout=testapp-rollout time_ms=24.470204
time="2025-08-05T16:37:06Z" level=info msg="Started syncing rollout" generation=71 namespace=test resourceVersion=126396619 rollout=testapp-rollout
time="2025-08-05T16:37:06Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:06Z" level=info msg="Reconciling TrafficRouting with type 'Nginx'" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:06Z" level=info msg="No changes to canary ingress - skipping patch" ingress=testapp-rollout-app-qa-canary namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:06Z" level=info msg="No changes to canary ingress - skipping patch" ingress=testapp-rollout-app-qa-gslb-canary namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:06Z" level=info msg="Reconciling analysis step (stepIndex: 2)" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:06Z" level=info msg="No status changes. Skipping patch" generation=71 namespace=test resourceVersion=126396619 rollout=testapp-rollout
time="2025-08-05T16:37:06Z" level=info msg="Reconciliation completed" generation=71 namespace=test resourceVersion=126396619 rollout=testapp-rollout time_ms=3.630396
time="2025-08-05T16:37:25Z" level=info msg="Started syncing rollout" generation=71 namespace=test resourceVersion=126396619 rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="Reconciling TrafficRouting with type 'Nginx'" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="No changes to canary ingress - skipping patch" ingress=testapp-rollout-app-qa-canary namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="No changes to canary ingress - skipping patch" ingress=testapp-rollout-app-qa-gslb-canary namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="Reconciling analysis step (stepIndex: 2)" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="Step Analysis Run 'testapp-rollout-954566f4d-34-2' Status New: 'Successful' Previous: 'Running'" event_reason=AnalysisRunSuccessful namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="Rollout step 3/5 completed (analysis)" event_reason=RolloutStepCompleted namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="Patched: {\"status\":{\"canary\":{\"currentStepAnalysisRunStatus\":null},\"conditions\":[{\"lastTransitionTime\":\"2025-07-21T10:52:44Z\",\"lastUpdateTime\":\"2025-07-21T10:52:44Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2025-08-05T16:36:40Z\",\"lastUpdateTime\":\"2025-08-05T16:36:40Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2025-08-05T16:36:40Z\",\"lastUpdateTime\":\"2025-08-05T16:36:40Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2025-08-05T16:37:06Z\",\"lastUpdateTime\":\"2025-08-05T16:37:06Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2025-08-05T16:37:06Z\",\"lastUpdateTime\":\"2025-08-05T16:37:25Z\",\"message\":\"ReplicaSet \\\"testapp-rollout-954566f4d\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"currentStepIndex\":3}}" generation=71 namespace=test resourceVersion=126396619 rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="rollout enqueue due to update event" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="persisted to informer" generation=71 namespace=test resourceVersion=126396852 rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="Reconciliation completed" generation=71 namespace=test resourceVersion=126396619 rollout=testapp-rollout time_ms=25.30153
time="2025-08-05T16:37:25Z" level=info msg="Started syncing rollout" generation=71 namespace=test resourceVersion=126396852 rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="Reconciling TrafficRouting with type 'Nginx'" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="updating canary Ingress" desiredWeight=50 ingress=testapp-rollout-app-qa-canary namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="Updating Ingress `testapp-rollout-app-qa-canary` to desiredWeight '50'" event_reason=PatchingCanaryIngress namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="updating canary Ingress" desiredWeight=50 ingress=testapp-rollout-app-qa-gslb-canary namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:25Z" level=info msg="Updating Ingress `testapp-rollout-app-qa-gslb-canary` to desiredWeight '50'" event_reason=PatchingCanaryIngress namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Previous weights: &TrafficWeights{Canary:WeightDestination{Weight:0,ServiceName:app-qa-canary,PodTemplateHash:954566f4d,},Stable:WeightDestination{Weight:100,ServiceName:app-qa-stable,PodTemplateHash:58875b6bf5,},Additional:[]WeightDestination{},Verified:nil,}" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="New weights: &TrafficWeights{Canary:WeightDestination{Weight:50,ServiceName:app-qa-canary,PodTemplateHash:954566f4d,},Stable:WeightDestination{Weight:50,ServiceName:app-qa-stable,PodTemplateHash:58875b6bf5,},Additional:[]WeightDestination{},Verified:nil,}" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Traffic weight updated from 0 to 50" event_reason=TrafficWeightUpdated namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Rollout step 4/5 completed (setWeight: 50)" event_reason=RolloutStepCompleted namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Patched: {\"status\":{\"canary\":{\"weights\":{\"canary\":{\"weight\":50},\"stable\":{\"weight\":50}}},\"conditions\":[{\"lastTransitionTime\":\"2025-07-21T10:52:44Z\",\"lastUpdateTime\":\"2025-07-21T10:52:44Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2025-08-05T16:36:40Z\",\"lastUpdateTime\":\"2025-08-05T16:36:40Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2025-08-05T16:36:40Z\",\"lastUpdateTime\":\"2025-08-05T16:36:40Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2025-08-05T16:37:06Z\",\"lastUpdateTime\":\"2025-08-05T16:37:06Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2025-08-05T16:37:06Z\",\"lastUpdateTime\":\"2025-08-05T16:37:26Z\",\"message\":\"ReplicaSet \\\"testapp-rollout-954566f4d\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"currentStepIndex\":4}}" generation=71 namespace=test resourceVersion=126396852 rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="persisted to informer" generation=71 namespace=test resourceVersion=126396889 rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Reconciliation completed" generation=71 namespace=test resourceVersion=126396852 rollout=testapp-rollout time_ms=208.38198699999998
time="2025-08-05T16:37:26Z" level=info msg="Started syncing rollout" generation=71 namespace=test resourceVersion=126396889 rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="rollout enqueue due to update event" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Reconciling TrafficRouting with type 'Nginx'" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="No changes to canary ingress - skipping patch" ingress=testapp-rollout-app-qa-canary namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="No changes to canary ingress - skipping patch" ingress=testapp-rollout-app-qa-gslb-canary namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Reconciling canary pause step (stepIndex: 4/5)" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Not finished reconciling Canary Pause" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Adding pause reason CanaryPauseStep with start time 2025-08-05T16:37:26Z" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Rollout is paused (CanaryPauseStep)" event_reason=RolloutPaused namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Patched: {\"status\":{\"controllerPause\":true,\"message\":\"CanaryPauseStep\",\"pauseConditions\":[{\"reason\":\"CanaryPauseStep\",\"startTime\":\"2025-08-05T16:37:26Z\"}],\"phase\":\"Paused\"}}" generation=71 namespace=test resourceVersion=126396889 rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="rollout enqueue due to update event" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="persisted to informer" generation=71 namespace=test resourceVersion=126396890 rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Reconciliation completed" generation=71 namespace=test resourceVersion=126396889 rollout=testapp-rollout time_ms=55.364008
time="2025-08-05T16:37:26Z" level=info msg="Started syncing rollout" generation=71 namespace=test resourceVersion=126396890 rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Patched conditions: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2025-07-21T10:52:44Z\",\"lastUpdateTime\":\"2025-07-21T10:52:44Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2025-08-05T16:36:40Z\",\"lastUpdateTime\":\"2025-08-05T16:36:40Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2025-08-05T16:36:40Z\",\"lastUpdateTime\":\"2025-08-05T16:36:40Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2025-08-05T16:37:26Z\",\"lastUpdateTime\":\"2025-08-05T16:37:26Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"Unknown\",\"type\":\"Progressing\"},{\"lastTransitionTime\":\"2025-08-05T16:37:26Z\",\"lastUpdateTime\":\"2025-08-05T16:37:26Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"True\",\"type\":\"Paused\"}]}}" generation=71 namespace=test resourceVersion=126396890 rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Reconciling TrafficRouting with type 'Nginx'" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="No changes to canary ingress - skipping patch" ingress=testapp-rollout-app-qa-canary namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="No changes to canary ingress - skipping patch" ingress=testapp-rollout-app-qa-gslb-canary namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Reconciling canary pause step (stepIndex: 4/5)" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Enqueueing Rollout in 14.913946381s seconds" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="rollout enqueue during wait" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Not finished reconciling Canary Pause" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="No status changes. Skipping patch" generation=71 namespace=test resourceVersion=126396890 rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="rollout enqueue due to update event" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="persisted to informer" generation=71 namespace=test resourceVersion=126396892 rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Reconciliation completed" generation=71 namespace=test resourceVersion=126396890 rollout=testapp-rollout time_ms=21.430397
time="2025-08-05T16:37:26Z" level=info msg="Started syncing rollout" generation=71 namespace=test resourceVersion=126396892 rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Reconciling TrafficRouting with type 'Nginx'" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="No changes to canary ingress - skipping patch" ingress=testapp-rollout-app-qa-canary namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="No changes to canary ingress - skipping patch" ingress=testapp-rollout-app-qa-gslb-canary namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Reconciling canary pause step (stepIndex: 4/5)" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Enqueueing Rollout in 14.904179881s seconds" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="rollout enqueue during wait" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Not finished reconciling Canary Pause" namespace=test rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="No status changes. Skipping patch" generation=71 namespace=test resourceVersion=126396892 rollout=testapp-rollout
time="2025-08-05T16:37:26Z" level=info msg="Reconciliation completed" generation=71 namespace=test resourceVersion=126396892 rollout=testapp-rollout time_ms=4.622091
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
I am facing exactly the same issue, so I am interested in a fix.
Well, I think this happens because the failureLimit is set to 1, and we only set AnalysisPhaseFailed when result.Failed > failureLimit. In the above AnalysisTemplate, result.Failed is 1, which does not satisfy that condition.
https://github.com/argoproj/argo-rollouts/blob/67d19e100f2408a8be90531388b2452be8d56eff/analysis/analysis.go#L667-L670
Setting the failureLimit to 0 in the above AnalysisTemplate correctly marks the AnalysisRun as Failed when the Job fails, whereas setting it to 1 causes the AnalysisRun to be marked as Successful on Job failure.
NAME KIND STATUS AGE INFO
⟳ rollouts-demo Rollout ✖ Degraded 43h
├──# revision:13
│ ├──⧉ rollouts-demo-687d76d795 ReplicaSet • ScaledDown 43h canary
│ └──α rollouts-demo-687d76d795-13-2 AnalysisRun ✖ Failed 10m ✖ 1
│ └──⊞ a204647d-e379-49d9-8a29-7abd15e1feb4.container-check.1 Job ✖ Failed 10m
├──# revision:12
│ ├──⧉ rollouts-demo-6cf78c66c5 ReplicaSet ✔ Healthy 66m stable
│ │ ├──□ rollouts-demo-6cf78c66c5-88ml8 Pod ✔ Running 20m ready:1/1
│ │ ├──□ rollouts-demo-6cf78c66c5-qrhbx Pod ✔ Running 19m ready:1/1
│ │ ├──□ rollouts-demo-6cf78c66c5-899br Pod ✔ Running 11m ready:1/1
│ │ ├──□ rollouts-demo-6cf78c66c5-xzdjt Pod ✔ Running 11m ready:1/1
│ │ └──□ rollouts-demo-6cf78c66c5-xgwqz Pod ✔ Running 10m ready:1/1
│ └──α rollouts-demo-6cf78c66c5-12-2 AnalysisRun ✔ Successful 19m ✖ 1
│ └──⊞ 641b93fd-b8e5-40b4-bb24-9ca927724521.container-check.1 Job ✖ Failed 19m
@krapie Thank you for the help. Verified that setting failureLimit to 0 works as expected.