argo-workflows
argo-workflows copied to clipboard
A bug encountered when executed examples from workflow-templates - A' errored: expected < 2 pods, got 2 - this is a bug
Checklist
- [x] Double-checked my configuration.
- [x] Tested using the latest version.
- [x] Used the Emissary executor.
Summary
I create a template using
argo template create -n argo templates.yaml
from examples/workflow-templates.
Then submitted dag.yaml using
argo submit -n argo --watch dag.yaml
What version are you running? argo: v3.3.8
Diagnostics
Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.
templates.yaml
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: workflow-template-whalesay-template
spec:
entrypoint: whalesay-template
templates:
- name: whalesay-template
inputs:
parameters:
- name: message
container:
image: docker/whalesay
command: [cowsay]
args: ["{{inputs.parameters.message}}"]
---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: workflow-template-random-fail-template
spec:
templates:
- name: random-fail-template
retryStrategy:
limit: 10
container:
image: python:alpine3.6
command: [python, -c]
# fail with a 66% probability
args: ["import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)"]
---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: workflow-template-inner-steps
spec:
templates:
- name: whalesay-template
inputs:
parameters:
- name: message
container:
image: docker/whalesay
command: [cowsay]
args: ["{{inputs.parameters.message}}"]
- name: inner-steps
steps:
- - name: inner-hello1
templateRef:
name: workflow-template-whalesay-template
template: whalesay-template
arguments:
parameters:
- name: message
value: "inner-hello1"
- - name: inner-hello2a
templateRef:
name: workflow-template-whalesay-template
template: whalesay-template
arguments:
parameters:
- name: message
value: "inner-hello2a"
- name: inner-hello2b
template: whalesay-template
arguments:
parameters:
- name: message
value: "inner-hello2b"
---
# The following workflow executes a diamond workflow
#
# A
# / \
# B C
# \ /
# D
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: workflow-template-inner-dag
spec:
templates:
- name: whalesay-template
inputs:
parameters:
- name: message
container:
image: docker/whalesay
command: [cowsay]
args: ["{{inputs.parameters.message}}"]
- name: inner-diamond
dag:
tasks:
- name: inner-A
templateRef:
name: workflow-template-whalesay-template
template: whalesay-template
arguments:
parameters:
- name: message
value: inner-A
- name: inner-B
depends: "inner-A"
template: whalesay-template
arguments:
parameters:
- name: message
value: inner-B
- name: inner-C
depends: "inner-A"
template: whalesay-template
arguments:
parameters:
- name: message
value: inner-C
- name: inner-D
depends: "inner-B && inner-C"
templateRef:
name: workflow-template-whalesay-template
template: whalesay-template
arguments:
parameters:
- name: message
value: inner-D
---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: workflow-template-submittable
spec:
entrypoint: whalesay-template
arguments:
parameters:
- name: message
value: hello world
templates:
- name: whalesay-template
inputs:
parameters:
- name: message
container:
image: docker/whalesay
command: [cowsay]
args: ["{{inputs.parameters.message}}"]
dag.yaml
# The following workflow executes a diamond workflow
#
# A
# / \
# B C
# \ /
# D
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: workflow-template-dag-diamond-
spec:
entrypoint: diamond
templates:
- name: diamond
dag:
tasks:
- name: A
templateRef:
name: workflow-template-whalesay-template
template: whalesay-template
arguments:
parameters:
- name: message
value: A
- name: B
depends: "A"
templateRef:
name: workflow-template-whalesay-template
template: whalesay-template
arguments:
parameters:
- name: message
value: B
- name: C
depends: "A"
templateRef:
name: workflow-template-inner-dag
template: inner-diamond
- name: D
depends: "B && C"
templateRef:
name: workflow-template-whalesay-template
template: whalesay-template
arguments:
parameters:
- name: message
value: D
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
@MISSEY looks like there are two pods for the task.A. Can you check the workflow pods on your namespace? Please provide the Ran workflow yaml kubectl get wf <workflowname>
and workflow Pod YAMLs.
Can you attach the workflow controller logs?
@sarabala1979
I am sorry for the late response; I was not well in the past week.
I am having now very different problems related to the same issue.
I prepared a workflow called ml-pipleine with DAGS, has two workflow-template, update-dataset and train-classifier.
I tried to explain problems in two experiments
1st experiment
pipeline.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: train-model
spec:
entrypoint: ml-pipeline
imagePullSecrets:
- name: ml-infrastructure-git-registry # secret setup in kubernetes cluster
volumes:
- name: workdir # name of the volume to be mounted in Workflow template
persistentVolumeClaim:
claimName: pvc-ml-infrastructure-datasets
- name: saurabh-ssh-key
secret:
secretName: saurabh-ssh-key
defaultMode: 0400
templates:
- name: ml-pipeline
dag:
failFast: false
tasks:
- name: update-dataset
templateRef:
name: update-dataset
template: update-dataset
arguments:
parameters:
- name: message
value: update-dataset
- name: train-classifier
depends: update-dataset
templateRef:
name: train-classifier
template: train-classifier
arguments:
parameters:
- name: message
value: train-classifier
update-dataset.yaml
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: update-dataset
spec:
templates:
- name: update-dataset
container:
image: git.ni.dfki.de:5050/ml_infrastructure/argo-workflow:ml_pipeline_0.0.3_classifier
command: [ sh, -c ]
args: [ "ls -la /mnt/datasets && env | grep AWS && python /code/update_dataset.py --gitrepo /mnt/datasets -b main -d hymenoptera_data"]
env:
- name: AWS_ACCESS_KEY_ID # for dvc
value: # Minio credential for DVC ,
- name: AWS_SECRET_ACCESS_KEY
value: # service key for minio account
volumeMounts:
- name: workdir
mountPath: /mnt/datasets
- name: saurabh-ssh-key
mountPath: "/root/.ssh"
train-classifier.yaml
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: train-classifier
spec:
templates:
- name: train-classifier
inputs:
parameters:
- name: message
container:
image: git.ni.dfki.de:5050/ml_infrastructure/argo-workflow:ml_pipeline_0.0.3_classifier
command: [ sh, -c ]
args: [ "echo {{inputs.parameters.message}} && python /code/cnn_classifier.py --datasetpath /mnt/datasets --model cnn_classifier --config cnn_classifier.yaml --datasetname hymenoptera_data"]
volumeMounts:
- name: workdir
mountPath: /mnt/datasets
- name: saurabh-ssh-key
mountPath: "/root/.ssh"
If I am running both the tasks is DAG, there is error on command line :
STEP TEMPLATE PODNAME DURATION MESSAGE
⚠ train-modelwfjjp ml-pipeline
├─⚠ update-dataset update-dataset/update-dataset train-modelwfjjp-update-dataset/update-dataset-1823868785 10s task 'train-modelwfjjp.update-dataset' errored: expected < 2 pods, got 2 - this is a bug
and on GUI
Though there was no error in update-dataset task, on GUI it failed.
$ kubectl describe pod train-modelwfjjp-update-dataset-24029997 -n argo
I0804 14:57:36.904530 195139 request.go:655] Throttling request took 1.17693408s, request: GET:https://lb.kubesphere.local:6443/apis/notification.kubesphere.io/v2beta1?timeout=32s
Name: train-modelwfjjp-update-dataset-240299978
Namespace: argo
Priority: 0
Node: agrigaia-ws3-u/10.249.3.13
Start Time: Thu, 04 Aug 2022 14:39:31 +0200
Labels: workflows.argoproj.io/completed=true
workflows.argoproj.io/workflow=train-modelwfjjp
Annotations: cni.projectcalico.org/containerID: d3a26f4470acacc62017e8e6a80e0ad199fc211eeb76452830dc20e74ecadc0c
cni.projectcalico.org/podIP:
cni.projectcalico.org/podIPs:
kubectl.kubernetes.io/default-container: main
workflows.argoproj.io/node-id: train-modelwfjjp-240299978
workflows.argoproj.io/node-name: train-modelwfjjp.update-dataset
Status: Succeeded
IP: 10.233.96.88
IPs:
IP: 10.233.96.88
Controlled By: Workflow/train-modelwfjjp
Init Containers:
init:
Container ID: docker://bd3955ffabeebec24e35ea85e7cfd20f20a2fd4881a9bf68b648c22a5df11b48
Image: quay.io/argoproj/argoexec:latest
Image ID: docker-pullable://quay.io/argoproj/argoexec@sha256:0bc809b17019ac2c21a5e0cfc04715027bf8948a4d9e9632b48a9be42a1343b7
Port: <none>
Host Port: <none>
Command:
argoexec
init
--loglevel
info
--log-format
text
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 04 Aug 2022 14:39:34 +0200
Finished: Thu, 04 Aug 2022 14:39:34 +0200
Ready: True
Restart Count: 0
Requests:
cpu: 10m
memory: 64Mi
Environment:
ARGO_POD_NAME: train-modelwfjjp-update-dataset-240299978 (v1:metadata.name)
ARGO_POD_UID: (v1:metadata.uid)
GODEBUG: x509ignoreCN=0
ARGO_WORKFLOW_NAME: train-modelwfjjp
ARGO_CONTAINER_NAME: init
ARGO_TEMPLATE: {"name":"update-dataset","inputs":{},"outputs":{},"metadata":{},"container":{"name":"","image":"git.ni.dfki.de:5050/ml_infrastructure/argo-workflow:ml_pipeline_0.0.3_classifier","command":["sh","-c"],"args":["ls -la /mnt/datasets \u0026\u0026 env | grep AWS \u0026\u0026 python /code/update_dataset.py --gitrepo /mnt/datasets -b main -d hymenoptera_data"],"env":[{"name":"AWS_ACCESS_KEY_ID","value":"HK5KFDD6SPP2BAPDP6HX"},{"name":"AWS_SECRET_ACCESS_KEY","value":"iMaNleYB+oKEMILUHD6KLoQzClk8hlD+xldzOVGC"}],"resources":{},"volumeMounts":[{"name":"workdir","mountPath":"/mnt/datasets"},{"name":"saurabh-ssh-key","mountPath":"/root/.ssh"}]},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"minio:9000","bucket":"my-bucket","insecure":true,"accessKeySecret":{"name":"my-minio-cred","key":"accesskey"},"secretKeySecret":{"name":"my-minio-cred","key":"secretkey"},"key":"train-modelwfjjp/train-modelwfjjp-update-dataset-240299978"}}}
ARGO_NODE_ID: train-modelwfjjp-240299978
ARGO_INCLUDE_SCRIPT_OUTPUT: false
ARGO_DEADLINE: 0001-01-01T00:00:00Z
ARGO_PROGRESS_FILE: /var/run/argo/progress
ARGO_PROGRESS_PATCH_TICK_DURATION: 1m0s
ARGO_PROGRESS_FILE_TICK_DURATION: 3s
Mounts:
/argo/secret/my-minio-cred from my-minio-cred (ro)
/var/run/argo from var-run-argo (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-2ccmp (ro)
Containers:
wait:
Container ID: docker://41b061c01540a2d43e67bb2b1fd4ca8ee202e8f85c02477793f5694dd55c68dd
Image: quay.io/argoproj/argoexec:latest
Image ID: docker-pullable://quay.io/argoproj/argoexec@sha256:0bc809b17019ac2c21a5e0cfc04715027bf8948a4d9e9632b48a9be42a1343b7
Port: <none>
Host Port: <none>
Command:
argoexec
wait
--loglevel
info
--log-format
text
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 04 Aug 2022 14:39:36 +0200
Finished: Thu, 04 Aug 2022 14:39:45 +0200
Ready: False
Restart Count: 0
Requests:
cpu: 10m
memory: 64Mi
Environment:
ARGO_POD_NAME: train-modelwfjjp-update-dataset-240299978 (v1:metadata.name)
ARGO_POD_UID: (v1:metadata.uid)
GODEBUG: x509ignoreCN=0
ARGO_WORKFLOW_NAME: train-modelwfjjp
ARGO_CONTAINER_NAME: wait
ARGO_TEMPLATE: {"name":"update-dataset","inputs":{},"outputs":{},"metadata":{},"container":{"name":"","image":"git.ni.dfki.de:5050/ml_infrastructure/argo-workflow:ml_pipeline_0.0.3_classifier","command":["sh","-c"],"args":["ls -la /mnt/datasets \u0026\u0026 env | grep AWS \u0026\u0026 python /code/update_dataset.py --gitrepo /mnt/datasets -b main -d hymenoptera_data"],"env":[{"name":"AWS_ACCESS_KEY_ID","value":"HK5KFDD6SPP2BAPDP6HX"},{"name":"AWS_SECRET_ACCESS_KEY","value":"iMaNleYB+oKEMILUHD6KLoQzClk8hlD+xldzOVGC"}],"resources":{},"volumeMounts":[{"name":"workdir","mountPath":"/mnt/datasets"},{"name":"saurabh-ssh-key","mountPath":"/root/.ssh"}]},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"minio:9000","bucket":"my-bucket","insecure":true,"accessKeySecret":{"name":"my-minio-cred","key":"accesskey"},"secretKeySecret":{"name":"my-minio-cred","key":"secretkey"},"key":"train-modelwfjjp/train-modelwfjjp-update-dataset-240299978"}}}
ARGO_NODE_ID: train-modelwfjjp-240299978
ARGO_INCLUDE_SCRIPT_OUTPUT: false
ARGO_DEADLINE: 0001-01-01T00:00:00Z
ARGO_PROGRESS_FILE: /var/run/argo/progress
ARGO_PROGRESS_PATCH_TICK_DURATION: 1m0s
ARGO_PROGRESS_FILE_TICK_DURATION: 3s
Mounts:
/argo/secret/my-minio-cred from my-minio-cred (ro)
/mainctrfs/mnt/datasets from workdir (rw)
/mainctrfs/root/.ssh from saurabh-ssh-key (rw)
/tmp from tmp-dir-argo (rw,path="0")
/var/run/argo from var-run-argo (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-2ccmp (ro)
main:
Container ID: docker://bac842dec1d6230e8ea111cae44af836bb279d06c773ec0ae4bc3925bfd03bb1
Image: git.ni.dfki.de:5050/ml_infrastructure/argo-workflow:ml_pipeline_0.0.3_classifier
Image ID: docker-pullable://git.ni.dfki.de:5050/ml_infrastructure/argo-workflow@sha256:03de5f43c8a38bcfab73a3657f30b45b12a6d41353e30c64639b87fcbc7cfc52
Port: <none>
Host Port: <none>
Command:
/var/run/argo/argoexec
emissary
--
sh
-c
Args:
ls -la /mnt/datasets && env | grep AWS && python /code/update_dataset.py --gitrepo /mnt/datasets -b main -d hymenoptera_data
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 04 Aug 2022 14:39:37 +0200
Finished: Thu, 04 Aug 2022 14:39:44 +0200
Ready: False
Restart Count: 0
Environment:
AWS_ACCESS_KEY_ID: HK5KFDD6SPP2BAPDP6HX
AWS_SECRET_ACCESS_KEY: iMaNleYB+oKEMILUHD6KLoQzClk8hlD+xldzOVGC
ARGO_CONTAINER_NAME: main
ARGO_TEMPLATE: {"name":"update-dataset","inputs":{},"outputs":{},"metadata":{},"container":{"name":"","image":"git.ni.dfki.de:5050/ml_infrastructure/argo-workflow:ml_pipeline_0.0.3_classifier","command":["sh","-c"],"args":["ls -la /mnt/datasets \u0026\u0026 env | grep AWS \u0026\u0026 python /code/update_dataset.py --gitrepo /mnt/datasets -b main -d hymenoptera_data"],"env":[{"name":"AWS_ACCESS_KEY_ID","value":"HK5KFDD6SPP2BAPDP6HX"},{"name":"AWS_SECRET_ACCESS_KEY","value":"iMaNleYB+oKEMILUHD6KLoQzClk8hlD+xldzOVGC"}],"resources":{},"volumeMounts":[{"name":"workdir","mountPath":"/mnt/datasets"},{"name":"saurabh-ssh-key","mountPath":"/root/.ssh"}]},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"minio:9000","bucket":"my-bucket","insecure":true,"accessKeySecret":{"name":"my-minio-cred","key":"accesskey"},"secretKeySecret":{"name":"my-minio-cred","key":"secretkey"},"key":"train-modelwfjjp/train-modelwfjjp-update-dataset-240299978"}}}
ARGO_NODE_ID: train-modelwfjjp-240299978
ARGO_INCLUDE_SCRIPT_OUTPUT: false
ARGO_DEADLINE: 0001-01-01T00:00:00Z
ARGO_PROGRESS_FILE: /var/run/argo/progress
ARGO_PROGRESS_PATCH_TICK_DURATION: 1m0s
ARGO_PROGRESS_FILE_TICK_DURATION: 3s
Mounts:
/mnt/datasets from workdir (rw)
/root/.ssh from saurabh-ssh-key (rw)
/var/run/argo from var-run-argo (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-2ccmp (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
var-run-argo:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
tmp-dir-argo:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
workdir:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: pvc-ml-infrastructure-datasets
ReadOnly: false
saurabh-ssh-key:
Type: Secret (a volume populated by a Secret)
SecretName: saurabh-ssh-key
Optional: false
my-minio-cred:
Type: Secret (a volume populated by a Secret)
SecretName: my-minio-cred
Optional: false
default-token-2ccmp:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-2ccmp
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 18m default-scheduler Successfully assigned argo/train-modelwfjjp-update-dataset-240299978 to agrigaia-ws3-u
Normal Pulling 18m kubelet Pulling image "quay.io/argoproj/argoexec:latest"
Normal Pulled 18m kubelet Successfully pulled image "quay.io/argoproj/argoexec:latest" in 1.552158278s
Normal Created 18m kubelet Created container init
Normal Started 18m kubelet Started container init
Normal Pulling 18m kubelet Pulling image "quay.io/argoproj/argoexec:latest"
Normal Pulled 18m kubelet Successfully pulled image "quay.io/argoproj/argoexec:latest" in 1.516819178s
Normal Created 18m kubelet Created container wait
Normal Started 18m kubelet Started container wait
Normal Pulled 18m kubelet Container image "git.ni.dfki.de:5050/ml_infrastructure/argo-workflow:ml_pipeline_0.0.3_classifier" already present on machine
Normal Created 18m kubelet Created container main
Normal Started 18m kubelet Started container main
2nd experiment
However, If I comment out one of the tasks in pipeline.yaml
, both tasks are running independently successful and get desired output on GUI, but on GUI task is again failed even the process is completed.
###upated pipeline.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: train-model
spec:
entrypoint: ml-pipeline
imagePullSecrets:
- name: ml-infrastructure-git-registry # secret setup in kubernetes cluster
volumes:
- name: workdir # name of the volume to be mounted in Workflow template
persistentVolumeClaim:
claimName: pvc-ml-infrastructure-datasets
- name: saurabh-ssh-key
secret:
secretName: saurabh-ssh-key
defaultMode: 0400
templates:
- name: ml-pipeline
dag:
failFast: false
tasks:
- name: update-dataset
templateRef:
name: update-dataset
template: update-dataset
arguments:
parameters:
- name: message
value: update-dataset
#- name: train-classifier
# depends: update-dataset
# templateRef:
# name: train-classifier
# template: train-classifier
# arguments:
# parameters:
# - name: message
# value: train-classifier
####Outputs :
STEP TEMPLATE PODNAME DURATION MESSAGE
⚠ train-modelpcndd ml-pipeline
└─⚠ update-dataset update-dataset/update-dataset train-modelpcndd-update-dataset/update-dataset-1823868785 10s task 'train-modelpcndd.update-dataset' errored: expected < 2 pods, got 2 - this is a bug
###2nd updated pipeline.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: train-model
spec:
entrypoint: ml-pipeline
imagePullSecrets:
- name: ml-infrastructure-git-registry # secret setup in kubernetes cluster
volumes:
- name: workdir # name of the volume to be mounted in Workflow template
persistentVolumeClaim:
claimName: pvc-ml-infrastructure-datasets
- name: saurabh-ssh-key
secret:
secretName: saurabh-ssh-key
defaultMode: 0400
templates:
- name: ml-pipeline
dag:
failFast: fail
tasks:
#- name: update-dataset
# templateRef:
# name: update-dataset
# template: update-dataset
# arguments:
# parameters:
# - name: message
# value: update-dataset
- name: train-classifier
# depends: update-dataset
templateRef:
name: train-classifier
template: train-classifier
arguments:
parameters:
- name: message
value: train-classifier
####output
STEP TEMPLATE PODNAME DURATION MESSAGE
⚠ train-modelgfw7t ml-pipeline
└─⚠ train-classifier train-classifier/train-classifier train-modelgfw7t-train-classifier/train-classifier-302168041 10s task 'train-modelgfw7t.train-classifier' errored: expected < 2 pods, got 2 - this is a bug
The pod is still running and model training is going inside pod
kubectl get pods -n argo
NAME READY STATUS RESTARTS AGE
argo-server-5854fd8bf9-tjsxx 1/1 Running 2 20d
httpbin-744fd84d99-lqhbr 1/1 Running 0 20d
minio-79b96ccfb8-l7wqn 1/1 Running 0 20d
postgres-6b5c55f477-lkb8t 1/1 Running 0 20d
train-modelgfw7t-2195991128 2/2 Running 0 2m1s
train-modelgfw7t-train-classifier-2195991128 2/2 Running 0 2m1s
train-modelpcndd-3393073072 0/2 Completed 0 9m52s
train-modelpcndd-update-dataset-3393073072 0/2 Completed 0 9m52s
train-modelwfjjp-240299978 0/2 Completed 0 31m
train-modelwfjjp-update-dataset-240299978 0/2 Completed 0 31m
workflow-controller-66fd66b857-gbt9z 1/1 Running 4 20d
GUI shows the task is cancelled due to error
####intermediate logs
But the pod is running and logs are also getting updated !!!
Hi @sarabala1979, did you find out what went wrong?
I am also seeing this issue. I noticed it may happen when we submit an workflow and after the work flow is completed, we quickly delete it and resubmit the workflow with the same name.
I submitted the below example workflow:
# Example of loops using DAGs
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: hello3
spec:
entrypoint: loops-dag
templates:
- name: loops-dag
dag:
tasks:
- name: A
template: whalesay
arguments:
parameters:
- {name: message, value: A}
- name: B
depends: "A"
template: whalesay
arguments:
parameters:
- {name: message, value: B}
- name: C
depends: "B"
template: whalesay
arguments:
parameters:
- {name: message, value: C}
- name: whalesay
inputs:
parameters:
- name: message
container:
image: docker/whalesay:latest
command: [cowsay]
args: ["{{inputs.parameters.message}}"]
After the workflow is completed, delete the workflow "hello3" and quickly resubmit the workflow with the above specification.
Work-around: do not submit workflows with same name quickly.
I am also experiencing this exact same issue. Most strange is that i have this example same set of workflows deployed to an identical cluster which does not experience the issue. This leads me to believe that it is to do with an artefact left over from a previous deployment.
Both environments are running workflow-controller
Image: quay.io/argoproj/workflow-controller:v3.3.6
Image ID: quay.io/argoproj/workflow-controller@sha256:4aa99011a916680b8866c27cd5b56667eb4d91a853acb2c4ae7d153bc9288043
I solved this.
It turns out i had 2 argo charts deployed so 2 workflow controllers so 2 pods
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
This issue has been closed due to inactivity. Feel free to re-open if you still encounter this issue.
The issue can be easily reproduced by making one step run for longer than the scheduling frequencies. Something like the below CronWorkflow. The step step-fetch-hello-hey completes in 800 seconds. But the the workflow is scheduled every 120 seconds and we see the issue (at least with concurrencyPolicy: "Replace")
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
name: testdag
spec:
schedule: "*/2 * * * *"
concurrencyPolicy: "Replace"
startingDeadlineSeconds: 0
workflowSpec:
entrypoint: trproc-dag
templates:
- name: trproc-dag
dag:
tasks:
- name: step-fetch-hello-hi
template: fetch-hello-hi
- name: step-fetch-hello-hey
depends: "step-fetch-hello-hi"
template: fetch-hello-hey
- name: step-process-hello-hey
depends: "step-fetch-hello-hey"
template: process-hello-hey
- name: step-prep-another-hey
depends: "step-process-hello-hey"
template: prep-another-hey
- name: fetch-hello-hi
container:
image: nipuntest:1.0
imagePullPolicy: IfNotPresent
command: ["python", "sleepseconds.py", "2"]
- name: fetch-hello-hey
container:
image: nipuntest:1.0
imagePullPolicy: IfNotPresent
command: ["python", "sleepseconds.py", "800"]
- name: process-hello-hey
container:
image: nipuntest:1.0
imagePullPolicy: IfNotPresent
command: ["python", "sleepseconds.py", "2"]
- name: prep-another-hey
container:
image: nipuntest:1.0
imagePullPolicy: IfNotPresent
command: ["python", "sleepseconds.py", "2"]
Hello, I have the same problem. It worked correctly until yesterday. Do not know what has changed... What I see in the log of a node for wait pod: time="2023-02-08T12:43:04.523Z" level=info msg="Creating a docker executor" Is it normal? I am using argo workflows 3.4.5, but I see that argoexec is 2.12.5. Surely the problem. I will check what is wrong. Sorry, it is my fault. Another controller pod was running version 2.12.5. Manipulation error somewhere...