argo-rollouts
argo-rollouts copied to clipboard
Job Analysis Run never fails when the job image can't be pulled (it just succeeds and promotes).
Checklist:
- [x] I've included steps to reproduce the bug.
- [x] I've included the version of argo rollouts.
Describe the bug
Analysis Run of type job never fails when the job image can't be pulled (it just succeeds and promotes). When Job pods can't pull image it goes into ErrImagePull
indefinitely and the job itself does not Fail. However, Argo rollouts waits for some time and considers this as success and then promotes the canary.
To Reproduce
- create a namespace called
test-namespace
- run the following kubernetes configs
- Wait for the pods to come up
- change the deployment image to
nginx:1.19.0
to trigger a rollout
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-deployment
namespace: test-namespace
spec:
replicas: 0
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.9.8 # update it to 1.19.0 to trigger a rollout, as the first time will only create the pods.
ports:
- containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: nginx-stable
namespace: test-namespace
spec:
selector:
app: nginx
ports:
- port: 80
targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: nginx-canary
namespace: test-namespace
spec:
selector:
app: nginx
ports:
- port: 80
targetPort: 8080
---
apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
name: acceptance-test
spec:
metrics:
- name: acceptance-test
provider:
job:
spec:
backoffLimit: 2
completions: 1
template:
spec:
containers:
- name: test
image: docker.io/library/lol:5 # this image does not exist.
command: ["bash"]
args: ["-c", "echo 'Hello world! going to exit with 0 (success).' && sleep 10 && exit 0"]
restartPolicy: Never
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: awesome-rollout
namespace: test-namespace
spec:
replicas: 2
workloadRef:
apiVersion: apps/v1
kind: Deployment
name: test-deployment
strategy:
canary:
canaryService: nginx-canary
stableService: nginx-stable
analysis:
templates:
- templateName: acceptance-test
clusterScope: true
steps:
- setWeight: 20
- pause: {duration: 10s}
- setWeight: 40
- pause: {duration: 10s}
- setWeight: 80
- pause: {duration: 10s}
Expected behavior
- It will succeed and promote and won't report a failure even though the job pod didn't start and was in
ErrImagePull
. It will also delete both the job and the rollout from the cluster.
Screenshots
Version Tested on v1.6.6 and previous versions.
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
@zachaller any update on this issue?
@MohammedShetaya According to this issue https://github.com/argoproj/argo-rollouts/issues/2250 some people might prefer the result of the analysis to be inconclusive if there is an image pull error.
Would you agree with this implementation?
@kostis-codefresh It should stay inconclusive then, in our case it considers the inconclusive state as success and proceeds with the rollout. This allows the canary deployment to go unchecked.
Also see discussion at https://github.com/argoproj/argo-rollouts/issues/1329
I think if Jobs supported successConditions you could maybe feed successConditions the output of a few things like pod exits codes and pod errors that could get used to define a success condition. This support would need to be added I think though.
I think if Jobs supported successConditions you could maybe feed successConditions the output of a few things like pod exits codes and pod errors that could get used to define a success condition. This support would need to be added I think though.
Exactly, I tried to do it assuming jobs, like other metrics, support successConditions, but it didn't work.
- Another note on this issue. It is not only when the job can't pull the image. It is a race condition between the job and the rollout. If the job takes much time to finish and the rollout reaches the last step the same behavior will happen (delete the job and succeed)