argo-workflows
argo-workflows copied to clipboard
realtime metric continues to report a workflow as running after workflow succeeded
Pre-requisites
- [X] I have double-checked my configuration
- [ ] I can confirm the issues exists when I tested with
:latest
- [ ] I'd like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
When exporting a realtime metric as follows from the workflow:
- name: example_duration # argo_workflows_example_duration
help: Duration of workflow
labels:
- key: status
value: "{{status}}"
gauge:
realtime: true
value: "{{duration}}"
the duration metric for Pending
continues counting after workflow is finished. I also see second metric for Succeeded
state.
In a larger example I also saw a metric Running
which continued counting, but in the smaller example for you, it probably goes too fast.
I would expect that it stops to count forward, i.e. it should lastly report a Running workflow with a duration in the last second the workflow is Running (or Pending). But that it continues counting until I delete the controller pod, looks like a bug to me.
Version
3.4.1
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: example
spec:
# serviceAccountName: moses
imagePullSecrets:
- name: dockersecret
entrypoint: example
nodeSelector:
hana-cloud.workload-class/default: "1"
templates:
- name: example
script:
image: bash
command: ['sleep', '10']
metrics:
prometheus:
# Metric to show success and failure, but not running or pending
# Metric name (will be prepended with "argo_workflows_")
- name: example_count # argo_workflows_example_count
help: Finished executions
labels:
- key: workflow_name
value: "{{workflow.name}}"
- key: namespace
value: "{{workflow.namespace}}"
- key: status
value: "{{status}}"
counter:
value: "1"
- name: example_duration # argo_workflows_example_duration
help: Duration of workflow
labels:
- key: status
value: "{{status}}"
gauge:
realtime: true
value: "{{duration}}"
Logs from the workflow controller
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
time="2023-02-10T16:14:15.851Z" level=info msg="Processing workflow" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.860Z" level=info msg="Updated phase -> Running" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.860Z" level=info msg="Pod node example-5f7l6 initialized Pending" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.860Z" level=warning msg="'script.source' is empty, suggest change template into 'container'" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.945Z" level=info msg="Created pod: example-5f7l6 (example-5f7l6)" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.945Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.945Z" level=info msg=reconcileAgentPod namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:15.951Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=204674 workflow=example-5f7l6
time="2023-02-10T16:14:25.946Z" level=info msg="Processing workflow" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:25.947Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=example-5f7l6
time="2023-02-10T16:14:25.947Z" level=info msg="node changed" namespace=default new.message= new.phase=Running new.progress=0/1 nodeID=example-5f7l6 old.message= old.phase=Pending old.progress=0/1 workflow=example-5f7l6
time="2023-02-10T16:14:25.948Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:25.948Z" level=info msg=reconcileAgentPod namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:25.954Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=204739 workflow=example-5f7l6
time="2023-02-10T16:14:35.955Z" level=info msg="Processing workflow" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.956Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=example-5f7l6
time="2023-02-10T16:14:35.956Z" level=info msg="node changed" namespace=default new.message= new.phase=Succeeded new.progress=0/1 nodeID=example-5f7l6 old.message= old.phase=Running old.progress=0/1 workflow=example-5f7l6
time="2023-02-10T16:14:35.956Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.957Z" level=info msg=reconcileAgentPod namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.957Z" level=info msg="Updated phase Running -> Succeeded" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.957Z" level=info msg="Marking workflow completed" namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.957Z" level=info msg="Checking daemoned children of " namespace=default workflow=example-5f7l6
time="2023-02-10T16:14:35.962Z" level=info msg="cleaning up pod" action=deletePod key=default/example-5f7l6-1340600742-agent/deletePod
time="2023-02-10T16:14:35.963Z" level=info msg="Workflow update successful" namespace=default phase=Succeeded resourceVersion=204801 workflow=example-5f7l6
time="2023-02-10T16:14:35.964Z" level=info msg="Queueing Succeeded workflow default/example-5f7l6 for delete in 1h0m0s due to TTL"
time="2023-02-10T16:14:35.972Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/example-5f7l6/labelPodCompleted
Logs from in your workflow's wait container
kubectl logs -c wait -l workflows.argoproj.io/workflow=example-5f7l6,workflow.argoproj.io/phase!=Succeeded
time="2023-02-10T16:14:18.258Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-02-10T16:14:18.258Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=default podName=example-5f7l6 template="{\"name\":\"example\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"script\":{\"name\":\"\",\"image\":\"bash\",\"command\":[\"sleep\",\"10\"],\"resources\":{},\"source\":\"\"},\"podSpecPatch\":\"{\\\"enableServiceLinks\\\": false}\\n\",\"metrics\":{\"prometheus\":[{\"name\":\"example_count\",\"labels\":[{\"key\":\"workflow_name\",\"value\":\"example-5f7l6\"},{\"key\":\"namespace\",\"value\":\"default\"},{\"key\":\"status\",\"value\":\"{{status}}\"}],\"help\":\"Finished executions\",\"counter\":{\"value\":\"1\"}},{\"name\":\"example_duration\",\"labels\":[{\"key\":\"status\",\"value\":\"{{status}}\"}],\"help\":\"Duration of workflow\",\"gauge\":{\"value\":\"{{duration}}\",\"realtime\":true}}]}}" version="&Version{Version:latest+unknown,BuildDate:2022-12-27T07:18:36Z,GitCommit:,GitTag:untagged,GitTreeState:clean,GoVersion:go1.18.9,Compiler:gc,Platform:linux/amd64,}"
time="2023-02-10T16:14:18.259Z" level=info msg="Starting deadline monitor"
time="2023-02-10T16:14:29.261Z" level=info msg="Main container completed" error="<nil>"
time="2023-02-10T16:14:29.261Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-02-10T16:14:29.261Z" level=info msg="No output parameters"
time="2023-02-10T16:14:29.261Z" level=info msg="No output artifacts"
time="2023-02-10T16:14:29.261Z" level=info msg="Deadline monitor stopped"
time="2023-02-10T16:14:29.261Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2023-02-10T16:14:29.261Z" level=info msg="Alloc=6826 TotalAlloc=12257 Sys=19666 NumGC=4 Goroutines=7"
You can enable metricsTTL
globally https://github.com/argoproj/argo-workflows/blob/0b8724b5332a854f8e59a2aefa4343c54f1d1539/docs/workflow-controller-configmap.yaml#L252 in order to cause this to clear (once the TTL has expired after the workflow has completed).
If I convert your workflow example into a very simple DAG
spec:
entrypoint: example
templates:
- name: example
metrics:
prometheus:
# Metric to show success and failure, but not running or pending
# Metric name (will be prepended with "argo_workflows_")
- name: example_count # argo_workflows_example_count
help: Finished executions
labels:
- key: workflow_name
value: "{{workflow.name}}"
- key: namespace
value: "{{workflow.namespace}}"
- key: status
value: "{{status}}"
counter:
value: "1"
- name: example_durationy # argo_workflows_example_duration
help: Duration of workflow
labels:
- key: status
value: "{{status}}"
gauge:
realtime: true
value: "{{duration}}"
dag:
tasks:
- name: example1
template: sleep
- name: sleep
script:
image: bash
command: ['sleep', '10']
I do instead get Running
followed by Succeeded
.
And with a slightly more complex dag this also works as with the simpler case:
dag:
tasks:
- name: example1
template: sleep
- name: example2
template: sleep
- name: example3
depends: (example1 && example2)
template: sleep
I'll investigate some more, but it seems that there is something specially peculiar about your simple test case staying in Pending, but without a metricsTTL
the metrics appear to run on when it appears they shouldn't.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
A workaround is also easily possible by just exposing state-bound metrics, e.g. duration for failed workflows only.
- name: paul
when: "{{status}} == Failed"
gauge:
realtime: true
value: "{{duration}}"
But you have to redesign your dashboards and alerts to actually be able to handle this.
This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.
This issue has been closed due to inactivity and lack of information. If you still encounter this issue, please add the requested information and re-open.