argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Daemon container displayed failed

Open mbouillaud opened this issue 4 months ago • 11 comments

Pre-requisites

  • [x] I have double-checked my configuration
  • [x] I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • [x] I have searched existing issues and could not find a match for this bug
  • [x] I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

When using daemon container, in a full succeded Workflow, daemon containers are displayed failed in the UI with exit code 0 :

Image

Daemon container definition :

container:
  readinessProbe:
    exec:
      command:
        - redis-cli
        - ping
      initialDelaySeconds: 2
      periodSeconds: 5
      timeoutSeconds: 5
  name: redis
  image: eu.gcr.io/myrepo/base/redis:6.0.16
  imagePullPolicy: Always
  ports:
    - containerPort: 6379

Pod status in UI :

Image

Redis daemon container logs :

29:signal-handler (1756156059) Received SIGTERM scheduling shutdown...
29:M 25 Aug 2025 21:07:39.423 # User requested shutdown...
29:M 25 Aug 2025 21:07:39.423 * Saving the final RDB snapshot before exiting.
29:M 25 Aug 2025 21:07:39.459 * DB saved on disk
29:M 25 Aug 2025 21:07:39.459 # Redis is now ready to exit, bye bye...

Version(s)

v3.7.1

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

ClusterWorkflowTemplate redis Daemon

apiVersion: argoproj.io/v1alpha1
kind: ClusterWorkflowTemplate
metadata:
  name: daemon-redis
spec:
  entrypoint: daemon-redis
  templates:
    - name: daemon-redis
      daemon: true
      terminationGracePeriodSeconds: 5
      container:
        readinessProbe:
          exec:
            command:
              - redis-cli
              - ping
          initialDelaySeconds: 2
          periodSeconds: 5
          timeoutSeconds: 5
        name: redis
        image: redis:6.0.16
        imagePullPolicy: Always
        ports:
          - containerPort: 6379

Workflow that use it

---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: test-workflow
  namespace: argo-workflows
spec:
  entrypoint: main
  serviceAccountName: argo-workflows-server
  templates:
    - name: main
      dag:
        tasks:
          - name: redis
            templateRef:
              name: daemon-redis
              template: daemon-redis
              clusterScope: true
          - name: mongodb
            templateRef:
              name: daemon-mongodb
              template: daemon-mongodb
              clusterScope: true
          - name: tests
            depends: redis && mongodb
            arguments:
              parameters:
                - name: composer_command
                  value: test
            templateRef:
              name: composer
              template: composer
              clusterScope: true

Logs from the workflow controller

❯ k logs argo-workflows-exploitation-workflow-controller-6795f478f-fcmcq | grep hhmw5
time="2025-08-26T07:37:02.220Z" level=info msg="Queueing Succeeded workflow argo-workflows/jenkins-test-phpfpm-pr-10-hhmw5 for delete in 4309h30m47s due to TTL"

Logs from in your workflow's wait container

no failed container in my workflow

mbouillaud avatar Aug 26 '25 07:08 mbouillaud

@mbouillaud please post a reproducible example, the current one uses templates and private images

eduardodbr avatar Aug 26 '25 10:08 eduardodbr

I've updated my initial post @eduardodbr ;) Sorry

mbouillaud avatar Aug 26 '25 12:08 mbouillaud

To add more details, daemon steps are green until the hook finish is succeeded :

Image

When hook finish succeeded, daemon steps are marked Failed

mbouillaud avatar Aug 29 '25 14:08 mbouillaud

Record of pod states during the Workflow : https://share.cleanshot.com/L8hlt1ws

mbouillaud avatar Aug 29 '25 14:08 mbouillaud

We've got the same issue after upgrading to 0.45.25 of the argo workflows helm chart, which uses image versions v3.7.2. The daemon workflows pods are marked as Failed even though the other pods are successful.

Also, I should mention that our Workflows use templateRef for those daemon tasks in the workflow dag.

OrushT avatar Sep 18 '25 06:09 OrushT

Is there any confirmation that this bug will be fixed in the next patch release of 3.7.x?

Charliefrodriguez avatar Oct 07 '25 12:10 Charliefrodriguez

plan to take a look on the weekend

tczhao avatar Oct 08 '25 03:10 tczhao

@mbouillaud @Charliefrodriguez

Could you help us generate a reproducible example? I tried with the following, but couldn't reproduce the issue

# This example demonstrates daemoned steps when used in in DAG templates. It is equivalent to the
# daemon-step.yaml example, but written in DAG format. The IP address of the daemoned step can be
# referenced using the '{{tasks.taskname.ip}}' variable.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: dag-daemon-task-
spec:
  hooks:
    finish:
      template: http
      expression: workflow.status == "Succeeded"
    running:
      expression: workflow.status == "Running"
      template: http
  entrypoint: daemon-example
  templates:
  - name: daemon-example
    dag:
      tasks:
      - name: influx
        template: influxdb

      - name: init-database
        template: influxdb-client
        depends: "influx"
        arguments:
          parameters:
          - name: cmd
            value: curl -XPOST 'http://{{tasks.influx.ip}}:8086/query' --data-urlencode "q=CREATE DATABASE mydb"

      - name: producer-1
        template: influxdb-client
        depends: "init-database"
        arguments:
          parameters:
          - name: cmd
            value: for i in $(seq 1 20); do curl -XPOST 'http://{{tasks.influx.ip}}:8086/write?db=mydb' -d "cpu,host=server01,region=uswest load=$i" ; sleep .5 ; done
      - name: producer-2
        template: influxdb-client
        depends: "init-database"
        arguments:
          parameters:
          - name: cmd
            value: for i in $(seq 1 20); do curl -XPOST 'http://{{tasks.influx.ip}}:8086/write?db=mydb' -d "cpu,host=server02,region=uswest load=$((RANDOM % 100))" ; sleep .5 ; done
      - name: producer-3
        template: influxdb-client
        depends: "init-database"
        arguments:
          parameters:
          - name: cmd
            value: curl -XPOST 'http://{{tasks.influx.ip}}:8086/write?db=mydb' -d 'cpu,host=server03,region=useast load=15.4'

      - name: consumer
        template: influxdb-client
        depends: "producer-1 && producer-2 && producer-3"
        arguments:
          parameters:
          - name: cmd
            value: curl --silent -G http://{{tasks.influx.ip}}:8086/query?pretty=true --data-urlencode "db=mydb" --data-urlencode "q=SELECT * FROM cpu"

  - name: influxdb
    daemon: true
    container:
      image: influxdb:1.2
      readinessProbe:
        httpGet:
          path: /ping
          port: 8086
        initialDelaySeconds: 5
        timeoutSeconds: 1

  - name: influxdb-client
    inputs:
      parameters:
      - name: cmd
    container:
      image: appropriate/curl:latest
      command: ["sh", "-c"]
      args: ["{{inputs.parameters.cmd}}"]

  - name: http
    http:
      url: "https://raw.githubusercontent.com/argoproj/argo-workflows/4e450e250168e6b4d51a126b784e90b11a0162bc/pkg/apis/workflow/v1alpha1/generated.swagger.json"

tczhao avatar Oct 12 '25 10:10 tczhao

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

github-actions[bot] avatar Oct 27 '25 02:10 github-actions[bot]

This issue has been closed due to inactivity and lack of information. If you still encounter this issue, please add the requested information and re-open.

github-actions[bot] avatar Nov 11 '25 02:11 github-actions[bot]

@tczhao Can this be reopened? We're still experiencing this issue on the current version of Argo workflows

zachabney avatar Nov 13 '25 19:11 zachabney