Expose paused and retired workers separately in prometheus
Closes #xxxx
- [x] Tests added / passed
- [x] Passes
pre-commit run --all-files
cc @ntabris for the grafana dashboards
Having paused and retiring and paused_or_retiring makes things more complicated for me, because various things would need to include that iff paused and retiring are not present (and not include if they are, otherwise we'd be double counting).
Thoughts about removing paused_or_retiring? I know this would be a breaking change in some sense, but it's also kinda a breaking change to have more non-exclusive states.
Thoughts about removing paused_or_retiring? I know this would be a breaking change in some sense, but it's also kinda a breaking change to have more non-exclusive states.
I think we don't have any strong preferences about keeping/removing the paused_or_retiring metric. Can you elaborate how adding those would be a breaking change?
I think we don't have any strong preferences about keeping/removing the paused_or_retiring metric. Can you elaborate how adding those would be a breaking change?
I said "kinda". It messes up anything that assumes states other than "connected" are exclusive, or that (eg) a chart of all states other than "connected" would make sense.
Instead, one would need logic that includes paused_or_retiring or [paused and retiring] but not both... which isn't very straightforward in Prometheus (I'm still thinking about how to do this).
Instead, one would need logic that includes paused_or_retiring or [paused and retiring] but not both... which isn't very straightforward in Prometheus (I'm still thinking about how to do this).
Right, that makes sense. It's a shame we didn't introduce the split metrics earlier. For now, we're mostly interested in the paused metric because this may highlight a problematic cluster behavior (too small in size) while the retiring signal is a little bit of noise and is usually harmless.
So I don't have a solution for the "nice chart" problem but when using this as a tag, the retired metric as a standalone thing already makes sense.
Unit Test Results
See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.
29 files ±0 29 suites ±0 11h 52m 24s :stopwatch: - 2m 15s 4 087 tests ±0 3 970 :white_check_mark: +1 112 :zzz: ±0 4 :x: - 2 1 :fire: +1 55 287 runs +1 52 844 :white_check_mark: +4 2 438 :zzz: - 1 4 :x: - 3 1 :fire: +1
For more details on these failures and errors, see this check.
Results for commit dfc84cbd. ± Comparison against base commit d68a5d9c.
:recycle: This comment has been updated with latest results.
I'd be fine with removing the paused_or_retiring metric, my main concern was backward compatibility.
We mostly care about paused as @fjetter said
FWIW, I'm +1 for removing paused_or_retiring. We don't have a good story for backward compatibility yet, and this change seems like a net improvement.
@phofl: Should we move this forward by removing paused_or_retiring?
Yep lets do that