micrometer
micrometer copied to clipboard
Includes exemplars for count and sum aggregate metrics for histograms
Currently exemplars are not included in the counter-like metrics recorded with histograms (e.g. foo_seconds_count
and foo_seconds_sum
). This change grabs the last non-null exemplar from any of the histogram buckets and applies it to those sampled values.
The lack of exemplars in these cases was surprising to our team as we do use the http_server_requests_seconds_count
metrics to aggregate all Spring requests regardless of their duration and drive alerting from those metrics, we feel there would be value in following exemplars to example traced requests at that granularity.
As for which metrics include exemplars or which exemplar to use, that is up for debate. I am treating _seconds_count
and _seconds_sum
as they feel like part of a counter, which does support exemplars in Prometheus, whereas _seconds_max
is a gauge. I also am grabbing the last non-null exemplar which may favor the larger buckets but I wanted it to be predictable as well as not requiring any additional iterations of the exemplars array.
:warning: 10 God Classes were detected by Lift in this project. Visit the Lift web console for more details.
There is a discussion about this on Micrometer Slack, let me copy here some interesting bits, in separate comments.
This is missing right now because OpenMetrics does not support Exemplars for Summary, I’m not sure why (counter and sum). And it seems Histogram only supports it on the buckets.
Other than this, I’m with you, I’m also missing this, e.g.: what if I have a Timer
, histogram is off and I want to use an exemplar for the counter? I think this should be a common use-case but the spec does not support it.
Also, TBH, I’m not sure what would be the consequences doing this :slightly_smiling_face: but:
- The Java client does not support Exemplars on the API level for Summary
- The Java client supports Exemplars on the API level for Histogram but on the output, only the buckets will have exemplars the count and sum does not
as it is defined in the specs.
I’m not sure that this implementation will work, it seems it will not give you the latest exemplar but it will give you the exemplar to the highest bucket. E.g.: what happens if you switch the recording order in your tests, i.e.: instead of this:
slos.record(10);
slos.record(250);
slos.record(1_000);
do this:
slos.record(1_000);
slos.record(250);
slos.record(10);
After you fix the buckets, I think the test will be still broken because the counter will get the exemplar to the first recording because it is the highest. This behavior can result in the counter exemplar being initialized and never updated again.
Maybe it’s just me but I think the behavior matters (assuming that OpenMetrics will support exemplars there) because of two reasons:
- To me, this would be the expected behavior but I’m not sure where would I get unexpected data or get into trouble with this assumption and with the current implementation
- If the first request can take significantly slower (which isn’t rare): lazy init, reading files, populating caches, GC getting excited because of all of this, etc. the exemplar of the count might never get updated ever again
@HaloFour asked questions to get feedback on the exemplars support of the OpenMetrics specs on CNCF slack: https://cloud-native.slack.com/archives/C01NP3BV26R/p1665087946929039
Sounds good, we can revisit once we get clarity about the OpenMetrics spec. If we would require a "better" approach to determine which observation to use would that require waiting on a solution from the Prometheus client?
I think this depends on what feedback we will get back and how we want to implement it. I think it is likely that we don't need to wait for the Prometheus Client but let's see.
For the sake of documentation/being transparent: it seems Prometheus will enable exemplars for all time series: https://groups.google.com/g/prometheus-developers/c/zgu5hwV_2oo/m/5VfUiOfmAgAJ
@jonatan-ivanov do you know if there is a GitHub issue open tracking the work on the Prometheus server side to accept exemplars on all time series?
fyi: https://github.com/prometheus/prometheus/issues/11982
Superseded by #3996