argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

realtime metric continues to report a workflow as running after workflow succeeded

Open epDHowwD opened this issue 2 years ago • 5 comments

Pre-requisites

  • [X] I have double-checked my configuration
  • [ ] I can confirm the issues exists when I tested with :latest
  • [ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

When exporting a realtime metric as follows from the workflow:

          - name: example_duration # argo_workflows_example_duration
            help: Duration of workflow
            labels:
              - key: status
                value: "{{status}}"
            gauge:
              realtime: true
              value: "{{duration}}"

the duration metric for Pending continues counting after workflow is finished. I also see second metric for Succeeded state. image

In a larger example I also saw a metric Running which continued counting, but in the smaller example for you, it probably goes too fast.

I would expect that it stops to count forward, i.e. it should lastly report a Running workflow with a duration in the last second the workflow is Running (or Pending). But that it continues counting until I delete the controller pod, looks like a bug to me.

Version

3.4.1

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: example
spec:
#   serviceAccountName: moses
  imagePullSecrets:
    - name: dockersecret
  entrypoint: example
  nodeSelector:
    hana-cloud.workload-class/default: "1"
  templates:
    - name: example
      script:
        image: bash
        command: ['sleep', '10']

      metrics:
        prometheus:
          # Metric to show success and failure, but not running or pending
          # Metric name (will be prepended with "argo_workflows_")
          - name: example_count # argo_workflows_example_count
            help: Finished executions
            labels:
              - key: workflow_name
                value: "{{workflow.name}}"
              - key: namespace
                value: "{{workflow.namespace}}"
              - key: status
                value: "{{status}}"
            counter:
              value: "1"
          - name: example_duration # argo_workflows_example_duration
            help: Duration of workflow
            labels:
              - key: status
                value: "{{status}}"
            gauge:
              realtime: true
              value: "{{duration}}"

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

time="2023-02-10T16:14:15.851Z" level=info msg="Processing workflow" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.860Z" level=info msg="Updated phase  -> Running" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.860Z" level=info msg="Pod node example-5f7l6 initialized Pending" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.860Z" level=warning msg="'script.source' is empty, suggest change template into 'container'" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.945Z" level=info msg="Created pod: example-5f7l6 (example-5f7l6)" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.945Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.945Z" level=info msg=reconcileAgentPod namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.951Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=204674 workflow=example-5f7l6
time="2023-02-10T16:14:25.946Z" level=info msg="Processing workflow" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:25.947Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=example-5f7l6
time="2023-02-10T16:14:25.947Z" level=info msg="node changed" namespace=default new.message= new.phase=Running new.progress=0/1 nodeID=example-5f7l6 old.message= old.phase=Pending old.progress=0/1 workflow=example-5f7l6
time="2023-02-10T16:14:25.948Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:25.948Z" level=info msg=reconcileAgentPod namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:25.954Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=204739 workflow=example-5f7l6
time="2023-02-10T16:14:35.955Z" level=info msg="Processing workflow" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.956Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=example-5f7l6
time="2023-02-10T16:14:35.956Z" level=info msg="node changed" namespace=default new.message= new.phase=Succeeded new.progress=0/1 nodeID=example-5f7l6 old.message= old.phase=Running old.progress=0/1 workflow=example-5f7l6
time="2023-02-10T16:14:35.956Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.957Z" level=info msg=reconcileAgentPod namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.957Z" level=info msg="Updated phase Running -> Succeeded" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.957Z" level=info msg="Marking workflow completed" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.957Z" level=info msg="Checking daemoned children of " namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.962Z" level=info msg="cleaning up pod" action=deletePod key=default/example-5f7l6-1340600742-agent/deletePod
time="2023-02-10T16:14:35.963Z" level=info msg="Workflow update successful" namespace=default phase=Succeeded resourceVersion=204801 workflow=example-5f7l6
time="2023-02-10T16:14:35.964Z" level=info msg="Queueing Succeeded workflow default/example-5f7l6 for delete in 1h0m0s due to TTL"
time="2023-02-10T16:14:35.972Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/example-5f7l6/labelPodCompleted

Logs from in your workflow's wait container

kubectl logs -c wait -l workflows.argoproj.io/workflow=example-5f7l6,workflow.argoproj.io/phase!=Succeeded

time="2023-02-10T16:14:18.258Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-02-10T16:14:18.258Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=default podName=example-5f7l6 template="{\"name\":\"example\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"script\":{\"name\":\"\",\"image\":\"bash\",\"command\":[\"sleep\",\"10\"],\"resources\":{},\"source\":\"\"},\"podSpecPatch\":\"{\\\"enableServiceLinks\\\": false}\\n\",\"metrics\":{\"prometheus\":[{\"name\":\"example_count\",\"labels\":[{\"key\":\"workflow_name\",\"value\":\"example-5f7l6\"},{\"key\":\"namespace\",\"value\":\"default\"},{\"key\":\"status\",\"value\":\"{{status}}\"}],\"help\":\"Finished executions\",\"counter\":{\"value\":\"1\"}},{\"name\":\"example_duration\",\"labels\":[{\"key\":\"status\",\"value\":\"{{status}}\"}],\"help\":\"Duration of workflow\",\"gauge\":{\"value\":\"{{duration}}\",\"realtime\":true}}]}}" version="&Version{Version:latest+unknown,BuildDate:2022-12-27T07:18:36Z,GitCommit:,GitTag:untagged,GitTreeState:clean,GoVersion:go1.18.9,Compiler:gc,Platform:linux/amd64,}"
time="2023-02-10T16:14:18.259Z" level=info msg="Starting deadline monitor"
time="2023-02-10T16:14:29.261Z" level=info msg="Main container completed" error="<nil>"
time="2023-02-10T16:14:29.261Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-02-10T16:14:29.261Z" level=info msg="No output parameters"
time="2023-02-10T16:14:29.261Z" level=info msg="No output artifacts"
time="2023-02-10T16:14:29.261Z" level=info msg="Deadline monitor stopped"
time="2023-02-10T16:14:29.261Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2023-02-10T16:14:29.261Z" level=info msg="Alloc=6826 TotalAlloc=12257 Sys=19666 NumGC=4 Goroutines=7"

epDHowwD avatar Feb 10 '23 16:02 epDHowwD

image

epDHowwD avatar Mar 01 '23 11:03 epDHowwD

You can enable metricsTTL globally https://github.com/argoproj/argo-workflows/blob/0b8724b5332a854f8e59a2aefa4343c54f1d1539/docs/workflow-controller-configmap.yaml#L252 in order to cause this to clear (once the TTL has expired after the workflow has completed).

If I convert your workflow example into a very simple DAG

spec:
  entrypoint: example
  templates:
    - name: example
      metrics:
        prometheus:
          # Metric to show success and failure, but not running or pending
          # Metric name (will be prepended with "argo_workflows_")
          - name: example_count # argo_workflows_example_count
            help: Finished executions
            labels:
              - key: workflow_name
                value: "{{workflow.name}}"
              - key: namespace
                value: "{{workflow.namespace}}"
              - key: status
                value: "{{status}}"
            counter:
              value: "1"
          - name: example_durationy # argo_workflows_example_duration
            help: Duration of workflow
            labels:
              - key: status
                value: "{{status}}"
            gauge:
              realtime: true
              value: "{{duration}}"
      dag:
        tasks:
          - name: example1
            template: sleep
    - name: sleep
      script:
        image: bash
        command: ['sleep', '10']

I do instead get Running followed by Succeeded.

And with a slightly more complex dag this also works as with the simpler case:

      dag:
        tasks:
          - name: example1
            template: sleep
          - name: example2
            template: sleep
          - name: example3
            depends: (example1 && example2)
            template: sleep

I'll investigate some more, but it seems that there is something specially peculiar about your simple test case staying in Pending, but without a metricsTTL the metrics appear to run on when it appears they shouldn't.

Joibel avatar Mar 23 '23 21:03 Joibel

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] avatar Sep 17 '23 11:09 stale[bot]

A workaround is also easily possible by just exposing state-bound metrics, e.g. duration for failed workflows only.

          - name: paul
            when: "{{status}} == Failed"
            gauge:
              realtime: true
              value: "{{duration}}"

But you have to redesign your dashboards and alerts to actually be able to handle this.

epDHowwD avatar Sep 18 '23 07:09 epDHowwD

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

github-actions[bot] avatar Jan 21 '24 02:01 github-actions[bot]

This issue has been closed due to inactivity and lack of information. If you still encounter this issue, please add the requested information and re-open.

github-actions[bot] avatar Feb 05 '24 02:02 github-actions[bot]