argo-workflows realtime metric continues to report a workflow as running after workflow succeeded

Pre-requisites

[X] I have double-checked my configuration
[ ] I can confirm the issues exists when I tested with :latest
[ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

When exporting a realtime metric as follows from the workflow:

          - name: example_duration # argo_workflows_example_duration
            help: Duration of workflow
            labels:
              - key: status
                value: "{{status}}"
            gauge:
              realtime: true
              value: "{{duration}}"

the duration metric for Pending continues counting after workflow is finished. I also see second metric for Succeeded state.

In a larger example I also saw a metric Running which continued counting, but in the smaller example for you, it probably goes too fast.

I would expect that it stops to count forward, i.e. it should lastly report a Running workflow with a duration in the last second the workflow is Running (or Pending). But that it continues counting until I delete the controller pod, looks like a bug to me.

Version

3.4.1

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: example
spec:
#   serviceAccountName: moses
  imagePullSecrets:
    - name: dockersecret
  entrypoint: example
  nodeSelector:
    hana-cloud.workload-class/default: "1"
  templates:
    - name: example
      script:
        image: bash
        command: ['sleep', '10']

      metrics:
        prometheus:
          # Metric to show success and failure, but not running or pending
          # Metric name (will be prepended with "argo_workflows_")
          - name: example_count # argo_workflows_example_count
            help: Finished executions
            labels:
              - key: workflow_name
                value: "{{workflow.name}}"
              - key: namespace
                value: "{{workflow.namespace}}"
              - key: status
                value: "{{status}}"
            counter:
              value: "1"
          - name: example_duration # argo_workflows_example_duration
            help: Duration of workflow
            labels:
              - key: status
                value: "{{status}}"
            gauge:
              realtime: true
              value: "{{duration}}"

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

time="2023-02-10T16:14:15.851Z" level=info msg="Processing workflow" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.860Z" level=info msg="Updated phase  -> Running" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.860Z" level=info msg="Pod node example-5f7l6 initialized Pending" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.860Z" level=warning msg="'script.source' is empty, suggest change template into 'container'" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.945Z" level=info msg="Created pod: example-5f7l6 (example-5f7l6)" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.945Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.945Z" level=info msg=reconcileAgentPod namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.951Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=204674 workflow=example-5f7l6
time="2023-02-10T16:14:25.946Z" level=info msg="Processing workflow" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:25.947Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=example-5f7l6
time="2023-02-10T16:14:25.947Z" level=info msg="node changed" namespace=default new.message= new.phase=Running new.progress=0/1 nodeID=example-5f7l6 old.message= old.phase=Pending old.progress=0/1 workflow=example-5f7l6
time="2023-02-10T16:14:25.948Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:25.948Z" level=info msg=reconcileAgentPod namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:25.954Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=204739 workflow=example-5f7l6
time="2023-02-10T16:14:35.955Z" level=info msg="Processing workflow" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.956Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=example-5f7l6
time="2023-02-10T16:14:35.956Z" level=info msg="node changed" namespace=default new.message= new.phase=Succeeded new.progress=0/1 nodeID=example-5f7l6 old.message= old.phase=Running old.progress=0/1 workflow=example-5f7l6
time="2023-02-10T16:14:35.956Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.957Z" level=info msg=reconcileAgentPod namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.957Z" level=info msg="Updated phase Running -> Succeeded" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.957Z" level=info msg="Marking workflow completed" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.957Z" level=info msg="Checking daemoned children of " namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.962Z" level=info msg="cleaning up pod" action=deletePod key=default/example-5f7l6-1340600742-agent/deletePod
time="2023-02-10T16:14:35.963Z" level=info msg="Workflow update successful" namespace=default phase=Succeeded resourceVersion=204801 workflow=example-5f7l6
time="2023-02-10T16:14:35.964Z" level=info msg="Queueing Succeeded workflow default/example-5f7l6 for delete in 1h0m0s due to TTL"
time="2023-02-10T16:14:35.972Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/example-5f7l6/labelPodCompleted

Logs from in your workflow's wait container

kubectl logs -c wait -l workflows.argoproj.io/workflow=example-5f7l6,workflow.argoproj.io/phase!=Succeeded

time="2023-02-10T16:14:18.258Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-02-10T16:14:18.258Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=default podName=example-5f7l6 template="{\"name\":\"example\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"script\":{\"name\":\"\",\"image\":\"bash\",\"command\":[\"sleep\",\"10\"],\"resources\":{},\"source\":\"\"},\"podSpecPatch\":\"{\\\"enableServiceLinks\\\": false}\\n\",\"metrics\":{\"prometheus\":[{\"name\":\"example_count\",\"labels\":[{\"key\":\"workflow_name\",\"value\":\"example-5f7l6\"},{\"key\":\"namespace\",\"value\":\"default\"},{\"key\":\"status\",\"value\":\"{{status}}\"}],\"help\":\"Finished executions\",\"counter\":{\"value\":\"1\"}},{\"name\":\"example_duration\",\"labels\":[{\"key\":\"status\",\"value\":\"{{status}}\"}],\"help\":\"Duration of workflow\",\"gauge\":{\"value\":\"{{duration}}\",\"realtime\":true}}]}}" version="&Version{Version:latest+unknown,BuildDate:2022-12-27T07:18:36Z,GitCommit:,GitTag:untagged,GitTreeState:clean,GoVersion:go1.18.9,Compiler:gc,Platform:linux/amd64,}"
time="2023-02-10T16:14:18.259Z" level=info msg="Starting deadline monitor"
time="2023-02-10T16:14:29.261Z" level=info msg="Main container completed" error="<nil>"
time="2023-02-10T16:14:29.261Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-02-10T16:14:29.261Z" level=info msg="No output parameters"
time="2023-02-10T16:14:29.261Z" level=info msg="No output artifacts"
time="2023-02-10T16:14:29.261Z" level=info msg="Deadline monitor stopped"
time="2023-02-10T16:14:29.261Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2023-02-10T16:14:29.261Z" level=info msg="Alloc=6826 TotalAlloc=12257 Sys=19666 NumGC=4 Goroutines=7"

Feb 10 '23 16:02 epDHowwD

Mar 01 '23 11:03 epDHowwD

You can enable metricsTTL globally https://github.com/argoproj/argo-workflows/blob/0b8724b5332a854f8e59a2aefa4343c54f1d1539/docs/workflow-controller-configmap.yaml#L252 in order to cause this to clear (once the TTL has expired after the workflow has completed).

If I convert your workflow example into a very simple DAG

spec:
  entrypoint: example
  templates:
    - name: example
      metrics:
        prometheus:
          # Metric to show success and failure, but not running or pending
          # Metric name (will be prepended with "argo_workflows_")
          - name: example_count # argo_workflows_example_count
            help: Finished executions
            labels:
              - key: workflow_name
                value: "{{workflow.name}}"
              - key: namespace
                value: "{{workflow.namespace}}"
              - key: status
                value: "{{status}}"
            counter:
              value: "1"
          - name: example_durationy # argo_workflows_example_duration
            help: Duration of workflow
            labels:
              - key: status
                value: "{{status}}"
            gauge:
              realtime: true
              value: "{{duration}}"
      dag:
        tasks:
          - name: example1
            template: sleep
    - name: sleep
      script:
        image: bash
        command: ['sleep', '10']

I do instead get Running followed by Succeeded.

And with a slightly more complex dag this also works as with the simpler case:

      dag:
        tasks:
          - name: example1
            template: sleep
          - name: example2
            template: sleep
          - name: example3
            depends: (example1 && example2)
            template: sleep

I'll investigate some more, but it seems that there is something specially peculiar about your simple test case staying in Pending, but without a metricsTTL the metrics appear to run on when it appears they shouldn't.

Mar 23 '23 21:03 Joibel

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Sep 17 '23 11:09 stale[bot]

A workaround is also easily possible by just exposing state-bound metrics, e.g. duration for failed workflows only.

          - name: paul
            when: "{{status}} == Failed"
            gauge:
              realtime: true
              value: "{{duration}}"

But you have to redesign your dashboards and alerts to actually be able to handle this.

Sep 18 '23 07:09 epDHowwD

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

Jan 21 '24 02:01 github-actions[bot]

This issue has been closed due to inactivity and lack of information. If you still encounter this issue, please add the requested information and re-open.

Feb 05 '24 02:02 github-actions[bot]

argo-workflows argo-workflows copied to clipboard

realtime metric continues to report a workflow as running after workflow succeeded

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

argo-workflows
argo-workflows copied to clipboard