argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

A bug encountered when executed examples from workflow-templates - A' errored: expected < 2 pods, got 2 - this is a bug

Open MISSEY opened this issue 1 year ago • 6 comments

Checklist

  • [x] Double-checked my configuration.
  • [x] Tested using the latest version.
  • [x] Used the Emissary executor.

Summary

I create a template using

argo template create -n argo templates.yaml 

from examples/workflow-templates.

Then submitted dag.yaml using

argo submit -n argo --watch dag.yaml

What version are you running? argo: v3.3.8

Diagnostics

Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.

templates.yaml

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: workflow-template-whalesay-template
spec:
  entrypoint: whalesay-template
  templates:
  - name: whalesay-template
    inputs:
      parameters:
      - name: message
    container:
      image: docker/whalesay
      command: [cowsay]
      args: ["{{inputs.parameters.message}}"]
---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: workflow-template-random-fail-template
spec:
  templates:
  - name: random-fail-template
    retryStrategy:
      limit: 10
    container:
      image: python:alpine3.6
      command: [python, -c]
      # fail with a 66% probability
      args: ["import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)"]
---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: workflow-template-inner-steps
spec:
  templates:
  - name: whalesay-template
    inputs:
      parameters:
      - name: message
    container:
      image: docker/whalesay
      command: [cowsay]
      args: ["{{inputs.parameters.message}}"]
  - name: inner-steps
    steps:
    - - name: inner-hello1
        templateRef:
          name: workflow-template-whalesay-template
          template: whalesay-template
        arguments:
          parameters:
          - name: message
            value: "inner-hello1"
    - - name: inner-hello2a
        templateRef:
          name: workflow-template-whalesay-template
          template: whalesay-template
        arguments:
          parameters:
          - name: message
            value: "inner-hello2a"
      - name: inner-hello2b
        template: whalesay-template
        arguments:
          parameters:
          - name: message
            value: "inner-hello2b"
---
# The following workflow executes a diamond workflow
#
#   A
#  / \
# B   C
#  \ /
#   D
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: workflow-template-inner-dag
spec:
  templates:
  - name: whalesay-template
    inputs:
      parameters:
      - name: message
    container:
      image: docker/whalesay
      command: [cowsay]
      args: ["{{inputs.parameters.message}}"]
  - name: inner-diamond
    dag:
      tasks:
      - name: inner-A
        templateRef:
          name: workflow-template-whalesay-template
          template: whalesay-template
        arguments:
          parameters:
          - name: message
            value: inner-A
      - name: inner-B
        depends: "inner-A"
        template: whalesay-template
        arguments:
          parameters:
          - name: message
            value: inner-B
      - name: inner-C
        depends: "inner-A"
        template: whalesay-template
        arguments:
          parameters:
          - name: message
            value: inner-C
      - name: inner-D
        depends: "inner-B && inner-C"
        templateRef:
          name: workflow-template-whalesay-template
          template: whalesay-template
        arguments:
          parameters:
          - name: message
            value: inner-D
---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: workflow-template-submittable
spec:
  entrypoint: whalesay-template
  arguments:
    parameters:
      - name: message
        value: hello world
  templates:
    - name: whalesay-template
      inputs:
        parameters:
          - name: message
      container:
        image: docker/whalesay
        command: [cowsay]
        args: ["{{inputs.parameters.message}}"]

dag.yaml

# The following workflow executes a diamond workflow
#
#   A
#  / \
# B   C
#  \ /
#   D
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: workflow-template-dag-diamond-
spec:
  entrypoint: diamond
  templates:
  - name: diamond
    dag:
      tasks:
      - name: A
        templateRef:
          name: workflow-template-whalesay-template
          template: whalesay-template
        arguments:
          parameters:
          - name: message
            value: A
      - name: B
        depends: "A"
        templateRef:
          name: workflow-template-whalesay-template
          template: whalesay-template
        arguments:
          parameters:
          - name: message
            value: B
      - name: C
        depends: "A"
        templateRef:
          name: workflow-template-inner-dag
          template: inner-diamond
      - name: D
        depends: "B && C"
        templateRef:
          name: workflow-template-whalesay-template
          template: whalesay-template
        arguments:
          parameters:
          - name: message
            value: D

image


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

MISSEY avatar Jul 27 '22 08:07 MISSEY

@MISSEY looks like there are two pods for the task.A. Can you check the workflow pods on your namespace? Please provide the Ran workflow yaml kubectl get wf <workflowname> and workflow Pod YAMLs.

Can you attach the workflow controller logs?

sarabala1979 avatar Jul 27 '22 15:07 sarabala1979

@sarabala1979

I am sorry for the late response; I was not well in the past week.

I am having now very different problems related to the same issue.

I prepared a workflow called ml-pipleine with DAGS, has two workflow-template, update-dataset and train-classifier.

I tried to explain problems in two experiments

1st experiment

pipeline.yaml

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: train-model
spec:
  entrypoint: ml-pipeline
  imagePullSecrets:
    - name: ml-infrastructure-git-registry # secret setup in kubernetes cluster
  volumes:
    - name: workdir # name of the volume to be mounted in Workflow template
      persistentVolumeClaim:
        claimName: pvc-ml-infrastructure-datasets
    - name: saurabh-ssh-key
      secret:
        secretName: saurabh-ssh-key
        defaultMode: 0400
  templates:
  - name: ml-pipeline
    dag:
      failFast: false
      tasks:
        - name: update-dataset
          templateRef:
            name: update-dataset
            template: update-dataset
          arguments:
            parameters:
              - name: message
                value: update-dataset

        - name: train-classifier
          depends: update-dataset
          templateRef:
            name: train-classifier
            template: train-classifier
          arguments:
            parameters:
              - name: message
                value: train-classifier

update-dataset.yaml

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: update-dataset
spec:
  templates:
    - name: update-dataset
      container:
        image: git.ni.dfki.de:5050/ml_infrastructure/argo-workflow:ml_pipeline_0.0.3_classifier
        command: [ sh, -c ]
        args: [ "ls -la /mnt/datasets && env | grep AWS && python /code/update_dataset.py --gitrepo /mnt/datasets -b main -d hymenoptera_data"]
        env:
          - name: AWS_ACCESS_KEY_ID # for dvc
            value:    #  Minio credential for DVC ,
          - name: AWS_SECRET_ACCESS_KEY
            value: # service key for minio account
        volumeMounts:
          - name: workdir
            mountPath: /mnt/datasets
          - name: saurabh-ssh-key
            mountPath: "/root/.ssh"

train-classifier.yaml


apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: train-classifier
spec:
  templates:
    - name: train-classifier
      inputs:
        parameters:
          - name: message
      container:
        image: git.ni.dfki.de:5050/ml_infrastructure/argo-workflow:ml_pipeline_0.0.3_classifier
        command: [ sh, -c ]
        args: [ "echo {{inputs.parameters.message}} && python /code/cnn_classifier.py --datasetpath /mnt/datasets --model cnn_classifier --config cnn_classifier.yaml  --datasetname hymenoptera_data"]
        volumeMounts:
          - name: workdir
            mountPath: /mnt/datasets
          - name: saurabh-ssh-key
            mountPath: "/root/.ssh"

If I am running both the tasks is DAG, there is error on command line :

STEP                 TEMPLATE                       PODNAME                                                    DURATION  MESSAGE
 ⚠ train-modelwfjjp  ml-pipeline                                                                                                                                                                                   
 ├─⚠ update-dataset  update-dataset/update-dataset  train-modelwfjjp-update-dataset/update-dataset-1823868785  10s       task 'train-modelwfjjp.update-dataset' errored: expected < 2 pods, got 2 - this is a bug 

and on GUI

image image

Though there was no error in update-dataset task, on GUI it failed.

$ kubectl describe pod train-modelwfjjp-update-dataset-24029997 -n argo
I0804 14:57:36.904530  195139 request.go:655] Throttling request took 1.17693408s, request: GET:https://lb.kubesphere.local:6443/apis/notification.kubesphere.io/v2beta1?timeout=32s
Name:         train-modelwfjjp-update-dataset-240299978
Namespace:    argo
Priority:     0
Node:         agrigaia-ws3-u/10.249.3.13
Start Time:   Thu, 04 Aug 2022 14:39:31 +0200
Labels:       workflows.argoproj.io/completed=true
              workflows.argoproj.io/workflow=train-modelwfjjp
Annotations:  cni.projectcalico.org/containerID: d3a26f4470acacc62017e8e6a80e0ad199fc211eeb76452830dc20e74ecadc0c
              cni.projectcalico.org/podIP: 
              cni.projectcalico.org/podIPs: 
              kubectl.kubernetes.io/default-container: main
              workflows.argoproj.io/node-id: train-modelwfjjp-240299978
              workflows.argoproj.io/node-name: train-modelwfjjp.update-dataset
Status:       Succeeded
IP:           10.233.96.88
IPs:
  IP:           10.233.96.88
Controlled By:  Workflow/train-modelwfjjp
Init Containers:
  init:
    Container ID:  docker://bd3955ffabeebec24e35ea85e7cfd20f20a2fd4881a9bf68b648c22a5df11b48
    Image:         quay.io/argoproj/argoexec:latest
    Image ID:      docker-pullable://quay.io/argoproj/argoexec@sha256:0bc809b17019ac2c21a5e0cfc04715027bf8948a4d9e9632b48a9be42a1343b7
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
      init
      --loglevel
      info
      --log-format
      text
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 04 Aug 2022 14:39:34 +0200
      Finished:     Thu, 04 Aug 2022 14:39:34 +0200
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     10m
      memory:  64Mi
    Environment:
      ARGO_POD_NAME:                      train-modelwfjjp-update-dataset-240299978 (v1:metadata.name)
      ARGO_POD_UID:                        (v1:metadata.uid)
      GODEBUG:                            x509ignoreCN=0
      ARGO_WORKFLOW_NAME:                 train-modelwfjjp
      ARGO_CONTAINER_NAME:                init
      ARGO_TEMPLATE:                      {"name":"update-dataset","inputs":{},"outputs":{},"metadata":{},"container":{"name":"","image":"git.ni.dfki.de:5050/ml_infrastructure/argo-workflow:ml_pipeline_0.0.3_classifier","command":["sh","-c"],"args":["ls -la /mnt/datasets \u0026\u0026 env | grep AWS \u0026\u0026 python /code/update_dataset.py --gitrepo /mnt/datasets -b main -d hymenoptera_data"],"env":[{"name":"AWS_ACCESS_KEY_ID","value":"HK5KFDD6SPP2BAPDP6HX"},{"name":"AWS_SECRET_ACCESS_KEY","value":"iMaNleYB+oKEMILUHD6KLoQzClk8hlD+xldzOVGC"}],"resources":{},"volumeMounts":[{"name":"workdir","mountPath":"/mnt/datasets"},{"name":"saurabh-ssh-key","mountPath":"/root/.ssh"}]},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"minio:9000","bucket":"my-bucket","insecure":true,"accessKeySecret":{"name":"my-minio-cred","key":"accesskey"},"secretKeySecret":{"name":"my-minio-cred","key":"secretkey"},"key":"train-modelwfjjp/train-modelwfjjp-update-dataset-240299978"}}}
      ARGO_NODE_ID:                       train-modelwfjjp-240299978
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      0001-01-01T00:00:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /argo/secret/my-minio-cred from my-minio-cred (ro)
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-2ccmp (ro)
Containers:
  wait:
    Container ID:  docker://41b061c01540a2d43e67bb2b1fd4ca8ee202e8f85c02477793f5694dd55c68dd
    Image:         quay.io/argoproj/argoexec:latest
    Image ID:      docker-pullable://quay.io/argoproj/argoexec@sha256:0bc809b17019ac2c21a5e0cfc04715027bf8948a4d9e9632b48a9be42a1343b7
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
      wait
      --loglevel
      info
      --log-format
      text
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 04 Aug 2022 14:39:36 +0200
      Finished:     Thu, 04 Aug 2022 14:39:45 +0200
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     10m
      memory:  64Mi
    Environment:
      ARGO_POD_NAME:                      train-modelwfjjp-update-dataset-240299978 (v1:metadata.name)
      ARGO_POD_UID:                        (v1:metadata.uid)
      GODEBUG:                            x509ignoreCN=0
      ARGO_WORKFLOW_NAME:                 train-modelwfjjp
      ARGO_CONTAINER_NAME:                wait
      ARGO_TEMPLATE:                      {"name":"update-dataset","inputs":{},"outputs":{},"metadata":{},"container":{"name":"","image":"git.ni.dfki.de:5050/ml_infrastructure/argo-workflow:ml_pipeline_0.0.3_classifier","command":["sh","-c"],"args":["ls -la /mnt/datasets \u0026\u0026 env | grep AWS \u0026\u0026 python /code/update_dataset.py --gitrepo /mnt/datasets -b main -d hymenoptera_data"],"env":[{"name":"AWS_ACCESS_KEY_ID","value":"HK5KFDD6SPP2BAPDP6HX"},{"name":"AWS_SECRET_ACCESS_KEY","value":"iMaNleYB+oKEMILUHD6KLoQzClk8hlD+xldzOVGC"}],"resources":{},"volumeMounts":[{"name":"workdir","mountPath":"/mnt/datasets"},{"name":"saurabh-ssh-key","mountPath":"/root/.ssh"}]},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"minio:9000","bucket":"my-bucket","insecure":true,"accessKeySecret":{"name":"my-minio-cred","key":"accesskey"},"secretKeySecret":{"name":"my-minio-cred","key":"secretkey"},"key":"train-modelwfjjp/train-modelwfjjp-update-dataset-240299978"}}}
      ARGO_NODE_ID:                       train-modelwfjjp-240299978
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      0001-01-01T00:00:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /argo/secret/my-minio-cred from my-minio-cred (ro)
      /mainctrfs/mnt/datasets from workdir (rw)
      /mainctrfs/root/.ssh from saurabh-ssh-key (rw)
      /tmp from tmp-dir-argo (rw,path="0")
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-2ccmp (ro)
  main:
    Container ID:  docker://bac842dec1d6230e8ea111cae44af836bb279d06c773ec0ae4bc3925bfd03bb1
    Image:         git.ni.dfki.de:5050/ml_infrastructure/argo-workflow:ml_pipeline_0.0.3_classifier
    Image ID:      docker-pullable://git.ni.dfki.de:5050/ml_infrastructure/argo-workflow@sha256:03de5f43c8a38bcfab73a3657f30b45b12a6d41353e30c64639b87fcbc7cfc52
    Port:          <none>
    Host Port:     <none>
    Command:
      /var/run/argo/argoexec
      emissary
      --
      sh
      -c
    Args:
      ls -la /mnt/datasets && env | grep AWS && python /code/update_dataset.py --gitrepo /mnt/datasets -b main -d hymenoptera_data
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 04 Aug 2022 14:39:37 +0200
      Finished:     Thu, 04 Aug 2022 14:39:44 +0200
    Ready:          False
    Restart Count:  0
    Environment:
      AWS_ACCESS_KEY_ID:                  HK5KFDD6SPP2BAPDP6HX
      AWS_SECRET_ACCESS_KEY:              iMaNleYB+oKEMILUHD6KLoQzClk8hlD+xldzOVGC
      ARGO_CONTAINER_NAME:                main
      ARGO_TEMPLATE:                      {"name":"update-dataset","inputs":{},"outputs":{},"metadata":{},"container":{"name":"","image":"git.ni.dfki.de:5050/ml_infrastructure/argo-workflow:ml_pipeline_0.0.3_classifier","command":["sh","-c"],"args":["ls -la /mnt/datasets \u0026\u0026 env | grep AWS \u0026\u0026 python /code/update_dataset.py --gitrepo /mnt/datasets -b main -d hymenoptera_data"],"env":[{"name":"AWS_ACCESS_KEY_ID","value":"HK5KFDD6SPP2BAPDP6HX"},{"name":"AWS_SECRET_ACCESS_KEY","value":"iMaNleYB+oKEMILUHD6KLoQzClk8hlD+xldzOVGC"}],"resources":{},"volumeMounts":[{"name":"workdir","mountPath":"/mnt/datasets"},{"name":"saurabh-ssh-key","mountPath":"/root/.ssh"}]},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"minio:9000","bucket":"my-bucket","insecure":true,"accessKeySecret":{"name":"my-minio-cred","key":"accesskey"},"secretKeySecret":{"name":"my-minio-cred","key":"secretkey"},"key":"train-modelwfjjp/train-modelwfjjp-update-dataset-240299978"}}}
      ARGO_NODE_ID:                       train-modelwfjjp-240299978
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      0001-01-01T00:00:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /mnt/datasets from workdir (rw)
      /root/.ssh from saurabh-ssh-key (rw)
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-2ccmp (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  var-run-argo:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  tmp-dir-argo:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  workdir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc-ml-infrastructure-datasets
    ReadOnly:   false
  saurabh-ssh-key:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  saurabh-ssh-key
    Optional:    false
  my-minio-cred:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  my-minio-cred
    Optional:    false
  default-token-2ccmp:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-2ccmp
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  18m   default-scheduler  Successfully assigned argo/train-modelwfjjp-update-dataset-240299978 to agrigaia-ws3-u
  Normal  Pulling    18m   kubelet            Pulling image "quay.io/argoproj/argoexec:latest"
  Normal  Pulled     18m   kubelet            Successfully pulled image "quay.io/argoproj/argoexec:latest" in 1.552158278s
  Normal  Created    18m   kubelet            Created container init
  Normal  Started    18m   kubelet            Started container init
  Normal  Pulling    18m   kubelet            Pulling image "quay.io/argoproj/argoexec:latest"
  Normal  Pulled     18m   kubelet            Successfully pulled image "quay.io/argoproj/argoexec:latest" in 1.516819178s
  Normal  Created    18m   kubelet            Created container wait
  Normal  Started    18m   kubelet            Started container wait
  Normal  Pulled     18m   kubelet            Container image "git.ni.dfki.de:5050/ml_infrastructure/argo-workflow:ml_pipeline_0.0.3_classifier" already present on machine
  Normal  Created    18m   kubelet            Created container main
  Normal  Started    18m   kubelet            Started container main



2nd experiment

However, If I comment out one of the tasks in pipeline.yaml, both tasks are running independently successful and get desired output on GUI, but on GUI task is again failed even the process is completed.

###upated pipeline.yaml

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: train-model
spec:
  entrypoint: ml-pipeline
  imagePullSecrets:
    - name: ml-infrastructure-git-registry # secret setup in kubernetes cluster
  volumes:
    - name: workdir # name of the volume to be mounted in Workflow template
      persistentVolumeClaim:
        claimName: pvc-ml-infrastructure-datasets
    - name: saurabh-ssh-key
      secret:
        secretName: saurabh-ssh-key
        defaultMode: 0400
  templates:
  - name: ml-pipeline
    dag:
      failFast: false
      tasks:
        - name: update-dataset
          templateRef:
            name: update-dataset
            template: update-dataset
          arguments:
            parameters:
              - name: message
                value: update-dataset

        #- name: train-classifier
        #  depends: update-dataset
        #  templateRef:
        #    name: train-classifier
        #    template: train-classifier
        #  arguments:
        #    parameters:
        #      - name: message
        #        value: train-classifier

####Outputs :

STEP                 TEMPLATE                       PODNAME                                                    DURATION  MESSAGE
 ⚠ train-modelpcndd  ml-pipeline                                                                                                                                                                                   
 └─⚠ update-dataset  update-dataset/update-dataset  train-modelpcndd-update-dataset/update-dataset-1823868785  10s       task 'train-modelpcndd.update-dataset' errored: expected < 2 pods, got 2 - this is a bug 

image

image

###2nd updated pipeline.yaml


apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: train-model
spec:
  entrypoint: ml-pipeline
  imagePullSecrets:
    - name: ml-infrastructure-git-registry # secret setup in kubernetes cluster
  volumes:
    - name: workdir # name of the volume to be mounted in Workflow template
      persistentVolumeClaim:
        claimName: pvc-ml-infrastructure-datasets
    - name: saurabh-ssh-key
      secret:
        secretName: saurabh-ssh-key
        defaultMode: 0400
  templates:
  - name: ml-pipeline
    dag:
      failFast: fail
      tasks:
        #- name: update-dataset
        #  templateRef:
        #    name: update-dataset
        #    template: update-dataset
        #  arguments:
        #    parameters:
        #      - name: message
        #        value: update-dataset

        - name: train-classifier
       #   depends: update-dataset
          templateRef:
            name: train-classifier
            template: train-classifier
          arguments:
            parameters:
              - name: message
                value: train-classifier

####output

STEP                   TEMPLATE                           PODNAME                                                       DURATION  MESSAGE
 ⚠ train-modelgfw7t    ml-pipeline                                                                                                                                                                                            
 └─⚠ train-classifier  train-classifier/train-classifier  train-modelgfw7t-train-classifier/train-classifier-302168041  10s       task 'train-modelgfw7t.train-classifier' errored: expected < 2 pods, got 2 - this is a bug 

The pod is still running and model training is going inside pod

kubectl get pods -n argo
NAME                                           READY   STATUS      RESTARTS   AGE
argo-server-5854fd8bf9-tjsxx                   1/1     Running     2          20d
httpbin-744fd84d99-lqhbr                       1/1     Running     0          20d
minio-79b96ccfb8-l7wqn                         1/1     Running     0          20d
postgres-6b5c55f477-lkb8t                      1/1     Running     0          20d
train-modelgfw7t-2195991128                    2/2     Running     0          2m1s
train-modelgfw7t-train-classifier-2195991128   2/2     Running     0          2m1s
train-modelpcndd-3393073072                    0/2     Completed   0          9m52s
train-modelpcndd-update-dataset-3393073072     0/2     Completed   0          9m52s
train-modelwfjjp-240299978                     0/2     Completed   0          31m
train-modelwfjjp-update-dataset-240299978      0/2     Completed   0          31m
workflow-controller-66fd66b857-gbt9z           1/1     Running     4          20d

GUI shows the task is cancelled due to error image

####intermediate logs

image

But the pod is running and logs are also getting updated !!!

image

image

MISSEY avatar Aug 04 '22 13:08 MISSEY

Hi @sarabala1979, did you find out what went wrong?

MISSEY avatar Aug 08 '22 08:08 MISSEY

I am also seeing this issue. I noticed it may happen when we submit an workflow and after the work flow is completed, we quickly delete it and resubmit the workflow with the same name.

I submitted the below example workflow:

# Example of loops using DAGs
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: hello3
spec:
  entrypoint: loops-dag
  templates:
  - name: loops-dag
    dag:
      tasks:
      - name: A
        template: whalesay
        arguments:
          parameters:
          - {name: message, value: A}
      - name: B
        depends: "A"
        template: whalesay
        arguments:
          parameters:
          - {name: message, value: B}
      - name: C
        depends: "B"
        template: whalesay
        arguments:
          parameters:
          - {name: message, value: C}

  - name: whalesay
    inputs:
      parameters:
      - name: message
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["{{inputs.parameters.message}}"]

After the workflow is completed, delete the workflow "hello3" and quickly resubmit the workflow with the above specification.

nipuntalukdar avatar Aug 31 '22 14:08 nipuntalukdar

Work-around: do not submit workflows with same name quickly.

alexec avatar Sep 05 '22 20:09 alexec

I am also experiencing this exact same issue. Most strange is that i have this example same set of workflows deployed to an identical cluster which does not experience the issue. This leads me to believe that it is to do with an artefact left over from a previous deployment.

Both environments are running workflow-controller

    Image:         quay.io/argoproj/workflow-controller:v3.3.6                                                                                                                                                              
    Image ID:      quay.io/argoproj/workflow-controller@sha256:4aa99011a916680b8866c27cd5b56667eb4d91a853acb2c4ae7d153bc9288043      

I solved this.

It turns out i had 2 argo charts deployed so 2 workflow controllers so 2 pods

8ball030 avatar Sep 08 '22 15:09 8ball030

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Oct 01 '22 06:10 stale[bot]

This issue has been closed due to inactivity. Feel free to re-open if you still encounter this issue.

stale[bot] avatar Oct 16 '22 00:10 stale[bot]

The issue can be easily reproduced by making one step run for longer than the scheduling frequencies. Something like the below CronWorkflow. The step step-fetch-hello-hey completes in 800 seconds. But the the workflow is scheduled every 120 seconds and we see the issue (at least with concurrencyPolicy: "Replace")

apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: testdag
spec:
  schedule: "*/2 * * * *"
  concurrencyPolicy: "Replace"
  startingDeadlineSeconds: 0
  workflowSpec:
    entrypoint: trproc-dag
    templates:
    - name: trproc-dag
      dag:
        tasks:
        - name: step-fetch-hello-hi
          template: fetch-hello-hi
        - name: step-fetch-hello-hey
          depends: "step-fetch-hello-hi"
          template: fetch-hello-hey
        - name: step-process-hello-hey
          depends: "step-fetch-hello-hey"
          template: process-hello-hey
        - name: step-prep-another-hey
          depends: "step-process-hello-hey"
          template: prep-another-hey

    - name: fetch-hello-hi
      container:
        image: nipuntest:1.0
        imagePullPolicy: IfNotPresent
        command: ["python", "sleepseconds.py", "2"]
 
    - name: fetch-hello-hey
      container:
        image: nipuntest:1.0
        imagePullPolicy: IfNotPresent
        command: ["python", "sleepseconds.py", "800"]
 
    - name: process-hello-hey
      container:
        image: nipuntest:1.0
        imagePullPolicy: IfNotPresent
        command: ["python", "sleepseconds.py", "2"]
 
    - name: prep-another-hey
      container:
        image: nipuntest:1.0
        imagePullPolicy: IfNotPresent
        command: ["python", "sleepseconds.py", "2"]

nipuntalukdar avatar Oct 18 '22 08:10 nipuntalukdar

Hello, I have the same problem. It worked correctly until yesterday. Do not know what has changed... What I see in the log of a node for wait pod: time="2023-02-08T12:43:04.523Z" level=info msg="Creating a docker executor" Is it normal? I am using argo workflows 3.4.5, but I see that argoexec is 2.12.5. Surely the problem. I will check what is wrong. Sorry, it is my fault. Another controller pod was running version 2.12.5. Manipulation error somewhere...

OlivierJavaux avatar Feb 07 '23 18:02 OlivierJavaux