awx icon indicating copy to clipboard operation
awx copied to clipboard

Job state metrics: split into gauge and counter

Open onefourfive opened this issue 1 year ago • 7 comments

Signed-off-by: onefourfive <>

SUMMARY

Closes #14369

Prometheus counter metric types should be used for metrics that can only increase, eg terminal jobs states like failed, canceled, error, and successful.

ISSUE TYPE
  • New or Enhanced Feature
COMPONENT NAME

Metrics

  • API
  • Other
AWX VERSION
awx: 22.5.1.dev29+gbd7ad057c4
ADDITIONAL INFORMATION

Metrics Before

# HELP awx_status_total Status of Job launched
# TYPE awx_status_total gauge
awx_status_total{status="running"} 1.0
awx_status_total{status="canceled"} 33.0
awx_status_total{status="waiting"} 0.0
awx_status_total{status="successful"} 65753.0
awx_status_total{status="error"} 359.0
awx_status_total{status="failed"} 985.0
awx_status_total{status="pending"} 0.0

Metrics After

# HELP awx_status_started Status of Job started
# TYPE awx_status_started gauge
awx_status_started{status="waiting"} 0.0
awx_status_started{status="pending"} 0.0
awx_status_started{status="running"} 0.0
# HELP awx_status_completed_total Status of Jobs completed
# TYPE awx_status_completed_total counter
awx_status_completed_total{status="canceled"} 0.0
awx_status_completed_total{status="failed"} 0.0
awx_status_completed_total{status="error"} 0.0
awx_status_completed_total{status="successful"} 0.0
# TYPE awx_status_completed_created gauge
awx_status_completed_created{status="canceled"} 1.693263938630845e+09
awx_status_completed_created{status="failed"} 1.6932639386309366e+09
awx_status_completed_created{status="error"} 1.6932639386309566e+09
awx_status_completed_created{status="successful"} 1.6932639386309748e+09

Note the _created metrics are automatically exported by Prometheus for counters. This can be disabled with an environment variable PROMETHEUS_DISABLE_CREATED_SERIES=True (see docs.

onefourfive avatar Aug 28 '23 23:08 onefourfive

Thank you for opening this PR. Our team will review it shortly and let you know if there are any changes needed.

jessicamack avatar Aug 30 '23 19:08 jessicamack

awx_status_completed_created{status="canceled"} 1.693263938630845e+09

These numbers don't look right, or is there something I'm not understanding?

AlanCoding avatar Aug 31 '23 15:08 AlanCoding

awx_status_completed_created{status="canceled"} 1.693263938630845e+09

These numbers don't look right, or is there something I'm not understanding?

This is a Unix time stamp showing when the metric was created— checking this value myself, this comes out to a date of this past Monday which makes sense.

onefourfive avatar Aug 31 '23 15:08 onefourfive

Oh, I'm showing my lack of understanding of this system then.

so the metrics awx_status_completed_total{status="canceled"} 0.0 and awx_status_completed_created{status="canceled"} 1.693263938630845e+09 are complementary, is that fair to say? Since this is a rolling count, we need a time window over which it applies, which is all times after the _created.

This is certainly cool. But this segways into the obvious question here - when you run cleanup_jobs system management job, it will reduce the total counts. After that happens, I could see the data collector freaking out.

AlanCoding avatar Aug 31 '23 16:08 AlanCoding

Is there anything in prometheus_client that allows us to add aliases for the old names?

Good question @gundalow ! Not that I am aware of, and none that I see in their documentation.

Since the underlying data driving these metrics hasn't been changed, it would still be possible to keep the original gauges if that's deemed to be needed. However that seems extraneous since it would be the same data.

This is certainly cool. But this segways into the obvious question here - when you run cleanup_jobs system management job, it will reduce the total counts. After that happens, I could see the data collector freaking out.

This is a good point @AlanCoding , while writing this I couldn't conceive of a situation where the counter awx_status_completed might be larger than the underlying statuses.get(state). This is a worthy case to consider.

onefourfive avatar Aug 31 '23 21:08 onefourfive

Also, what exactly is the timestamp for the *_created metrics? Is it from startup? Does the count represent the number of jobs that went into that status from startup, or the total number in the database?

AlanCoding avatar Sep 01 '23 19:09 AlanCoding

Hello, We took another look at this and saw that a few of the CI checks are failing. This particular test seems to be causing the failure:

_____________________________ test_metrics_counts ______________________________
[gw1] linux -- Python 3.9.17 /var/lib/awx/venv/awx/bin/python3.9
Traceback (most recent call last):
  File "/awx_devel/awx/main/tests/functional/analytics/test_metrics.py", line 54, in test_metrics_counts
    assert EXPECTED_VALUES[name] == value
KeyError: 'awx_status_completed_created' 

If you could go ahead and get that test updated we would be happy to further investigate this. Thank you again for your time!

djyasin avatar Sep 27 '23 19:09 djyasin