Ability to report pod-level success/failure as a metric for steps with retries
Summary
Based on the variables docs, it looks like it's possible to use the overall status of a step to generate metrics. In cases when a step includes multiple retries, I don't see a way to access / report the count(s) of pod-level statuses (i.e. 3 failures and 1 success).
Motivation
Our ETL workflows interact with external systems that we know are flaky, so we've slapped retries on nearly every step. Our workflows now succeed, but we've lost the ability to clearly see which steps are most volatile over time. Tracking the pod-level success counts for each step would help us regain that observability.
Proposal
The success count of a retry group is always 1 (unless something weird happens), so I think the only missing piece is an Argo variable containing the number of failed attempts in a retry group. If we had that, we could use it to increment a Prometheus counter.
Message from the maintainers:
If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.
Agreed this is related to #2898. I have a PR ready so if agreed we can work through the details and get it done.
@seddonm1 Feel free to open your PR with retryAttempt
It looks like this was completed in #2911