awx
awx copied to clipboard
Job state metrics: split into gauge and counter
Signed-off-by: onefourfive <>
SUMMARY
Closes #14369
Prometheus counter
metric types should be used for metrics that can only increase, eg terminal jobs states like failed
, canceled
, error
, and successful
.
ISSUE TYPE
- New or Enhanced Feature
COMPONENT NAME
Metrics
- API
- Other
AWX VERSION
awx: 22.5.1.dev29+gbd7ad057c4
ADDITIONAL INFORMATION
Metrics Before
# HELP awx_status_total Status of Job launched
# TYPE awx_status_total gauge
awx_status_total{status="running"} 1.0
awx_status_total{status="canceled"} 33.0
awx_status_total{status="waiting"} 0.0
awx_status_total{status="successful"} 65753.0
awx_status_total{status="error"} 359.0
awx_status_total{status="failed"} 985.0
awx_status_total{status="pending"} 0.0
Metrics After
# HELP awx_status_started Status of Job started
# TYPE awx_status_started gauge
awx_status_started{status="waiting"} 0.0
awx_status_started{status="pending"} 0.0
awx_status_started{status="running"} 0.0
# HELP awx_status_completed_total Status of Jobs completed
# TYPE awx_status_completed_total counter
awx_status_completed_total{status="canceled"} 0.0
awx_status_completed_total{status="failed"} 0.0
awx_status_completed_total{status="error"} 0.0
awx_status_completed_total{status="successful"} 0.0
# TYPE awx_status_completed_created gauge
awx_status_completed_created{status="canceled"} 1.693263938630845e+09
awx_status_completed_created{status="failed"} 1.6932639386309366e+09
awx_status_completed_created{status="error"} 1.6932639386309566e+09
awx_status_completed_created{status="successful"} 1.6932639386309748e+09
Note the _created
metrics are automatically exported by Prometheus for counters. This can be disabled with an environment variable PROMETHEUS_DISABLE_CREATED_SERIES=True
(see docs.
Thank you for opening this PR. Our team will review it shortly and let you know if there are any changes needed.
awx_status_completed_created{status="canceled"} 1.693263938630845e+09
These numbers don't look right, or is there something I'm not understanding?
awx_status_completed_created{status="canceled"} 1.693263938630845e+09
These numbers don't look right, or is there something I'm not understanding?
This is a Unix time stamp showing when the metric was created— checking this value myself, this comes out to a date of this past Monday which makes sense.
Oh, I'm showing my lack of understanding of this system then.
so the metrics awx_status_completed_total{status="canceled"} 0.0
and awx_status_completed_created{status="canceled"} 1.693263938630845e+09
are complementary, is that fair to say? Since this is a rolling count, we need a time window over which it applies, which is all times after the _created.
This is certainly cool. But this segways into the obvious question here - when you run cleanup_jobs system management job, it will reduce the total counts. After that happens, I could see the data collector freaking out.
Is there anything in prometheus_client that allows us to add aliases for the old names?
Good question @gundalow ! Not that I am aware of, and none that I see in their documentation.
Since the underlying data driving these metrics hasn't been changed, it would still be possible to keep the original gauges if that's deemed to be needed. However that seems extraneous since it would be the same data.
This is certainly cool. But this segways into the obvious question here - when you run cleanup_jobs system management job, it will reduce the total counts. After that happens, I could see the data collector freaking out.
This is a good point @AlanCoding , while writing this I couldn't conceive of a situation where the counter awx_status_completed
might be larger than the underlying statuses.get(state)
. This is a worthy case to consider.
Also, what exactly is the timestamp for the *_created
metrics? Is it from startup? Does the count represent the number of jobs that went into that status from startup, or the total number in the database?
Hello, We took another look at this and saw that a few of the CI checks are failing. This particular test seems to be causing the failure:
_____________________________ test_metrics_counts ______________________________
[gw1] linux -- Python 3.9.17 /var/lib/awx/venv/awx/bin/python3.9
Traceback (most recent call last):
File "/awx_devel/awx/main/tests/functional/analytics/test_metrics.py", line 54, in test_metrics_counts
assert EXPECTED_VALUES[name] == value
KeyError: 'awx_status_completed_created'
If you could go ahead and get that test updated we would be happy to further investigate this. Thank you again for your time!