argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Ability to report pod-level success/failure as a metric for steps with retries

Open danxmoran opened this issue 5 years ago • 2 comments

Summary

Based on the variables docs, it looks like it's possible to use the overall status of a step to generate metrics. In cases when a step includes multiple retries, I don't see a way to access / report the count(s) of pod-level statuses (i.e. 3 failures and 1 success).

Motivation

Our ETL workflows interact with external systems that we know are flaky, so we've slapped retries on nearly every step. Our workflows now succeed, but we've lost the ability to clearly see which steps are most volatile over time. Tracking the pod-level success counts for each step would help us regain that observability.

Proposal

The success count of a retry group is always 1 (unless something weird happens), so I think the only missing piece is an Argo variable containing the number of failed attempts in a retry group. If we had that, we could use it to increment a Prometheus counter.


Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

danxmoran avatar Apr 23 '20 19:04 danxmoran

Agreed this is related to #2898. I have a PR ready so if agreed we can work through the details and get it done.

seddonm1 avatar Apr 30 '20 04:04 seddonm1

@seddonm1 Feel free to open your PR with retryAttempt

simster7 avatar Apr 30 '20 15:04 simster7

It looks like this was completed in #2911

agilgur5 avatar Oct 18 '24 07:10 agilgur5