micrometer-docs
micrometer-docs copied to clipboard
Interpretation: What does max measure in Timer and DistributionSummary
Based on available documentation it is not clear what does the following micrometer/promethues measure. http_server_requests_seconds_max
It measures how long the longest request took for a given uri
tag in the last minute (I believe that is configurable).
If a given uri hasn't been called in that minute then you won't have a value.
@checketts Thanks for the answer. In Prometheus what is the TYPE of this metric
Off the top of my head it would be a gauge.
To make this issue actionable @tarunrathor-pro Can you provide a link to the documentation that you found confusing/incomplete?
The problem is that documentation on both spring and micrometer site does not describe this. Its just other sites which mention this without explaining it well. Having some official documentation will help
I want to implement a similar metric for node.js using prom-client so that we can uniformly define this metrics across java/node.js. Was curious to know the measurement , TYPE and logic to arrive at this in micrometer so that i can do a similar implementation for node.js
At a high level is it an addition to Timer
and is implemented via TimeWIndowMax.
@jkschneider Would know the finer points. But my high level understanding is they use a ring buffer to hold onto just the last minute's values and report that max that happened during that time.
When looking at the Prometheus endpoint for my Micrometer app, it gives the TYPE for every metric like
# TYPE http_server_requests_seconds_max gauge
Did anyone figure out what the http_server_requests_seconds_max
metric represents?
I've observed that it's not the maximum value as it's well below the average reported by the http_server_requests_seconds
metric. The TimeWindowMax
class describes it as a decaying maximum for a distribution based on a configurable ring buffer.
If anyone has any insights please let me know as I'm writing an article on these metrics. Happy to create a PR to include documentation too, if that helps.
Since this issue was opened, further clarification has been added to the documentation under the Timer section. See the NOTE
in section "10. Timers" https://micrometer.io/docs/concepts#_timers
Max for basic
Timer
implementations such asCumulativeTimer
,StepTimer
is a time window max (TimeWindowMax
). It means that its value is the maximum value during a time window. If no new values are recorded for the time window length, the max will be reset to 0 as a new time window starts. Time window size will be the step size of the meter registry unless expiry inDistributionStatisticConfig
is set to other value explicitly. The reason why a time window max is used is to capture max latency in a subsequent interval after heavy resource pressure triggers the latency and prevents metrics from being published.
@tarunrathor-pro and @tkgregory could you take a look at that and let us know if you think additional clarification is warranted. If so, what is lacking or confusing?
@shakuzen that's a lot clearer now. I obviously missed these docs in my search so thanks for pointing me in the right direction.
Although the note exists in the documentation today, a small section more prominently explaining the max and specifically TimeWindowMax is probably warranted. I'm moving this issue to the docs repo, which is where changes would be made. We'll have to figure out how to best work it into the documentation (suggestions welcome).
I'd just like to second this request. We were trying to determine exactly what the _max metric generated by a Timer represented, and what time period it was calculated over, and couldn't find the answer anywhere until we found this issue.
Although one of the first places I looked was in the micrometer documentation section on Timers that you referred me to above, it wasn't at all clear to me that the NOTE referencing TimeWindowMax that the replies above was (in part) an answer to our questions.
I think it should be much more explicitly documented that for any Timer, separate metrics with the _count, _sum, and _max suffixes (for Prometheus, at least) will be generated. And what the default time window is over which the _max value calculated (somebody above said 1 minute?), and how to change that time window, if we wish to do so.
Time window size will be the step size of the meter registry unless expiry in DistributionStatisticConfig is set to other value explicitly
Prometheus requires the bucket values to be accumulative and never decay, so I must set the expiry to null for Prometheus. However this _max
is a gauge calculated on a time window, and it must expire and can go down, there must be an expiry for it. So, does it mean the _max
will never go down for the Prometheus?
Is this expiry configuration only for "push-based" endpoints, not for "pull-based" like Prometheus? If so, how is the _max
gauge calculated when it is scraped from Promethues?
Prometheus requires the bucket values to be accumulative and never decay
Buckets in a Prometheus Histogram are Counters so they are monotonic (never decay). But Micrometer's Max is a Gauge not a Histogram, so monotonicity does not apply to them, a Gauge can go up or down freely.
so I must set the expiry to null for Prometheus.
I don't think so, see above. Also, a never expiring max is not very useful in general.
So, does it mean the _max will never go down for the Prometheus?
No, max will decay, see above.
Is this expiry configuration only for "push-based" endpoints, not for "pull-based" like Prometheus? If so, how is the _max gauge calculated when it is scraped from Promethues?
This is for most of the registries including Prometheususing, see: TimeWindowMax
.
Max for basic Timer implementations such as CumulativeTimer, StepTimer is a time window max (TimeWindowMax). It means that its value is the maximum value during a time window. If no new values are recorded for the time window length, the max will be reset to 0 as a new time window starts. Time window size will be the step size of the meter registry unless expiry in DistributionStatisticConfig is set to other value explicitly.
Let's say I have a Timer
Timer.builder("chassis.redis1.latency") // Note, "." will be converted to "_" for Prometheus format
.maximumExpectedValue(Duration.ofMillis(1000))
.publishPercentileHistogram()
.register(meterRegistry).record(someDuration, TimeUnit.MILLISECONDS);
It typically generates a max gauge, and some counters (_sum, _count and the buckets). e..g.,
# HELP chassis_redis1_latency_seconds_max
# TYPE chassis_redis1_latency_seconds_max gauge
chassis_redis1_latency_seconds_max{app="demo",node="001",service_name="demo",} 1.133
# HELP chassis_redis1_latency_seconds
# TYPE chassis_redis1_latency_seconds histogram
chassis_redis1_latency_seconds_count{app="demo",node="001",service_name="demo",} 200.0
chassis_redis1_latency_seconds_sum{app="demo",node="001",service_name="demo",} 204.375
chassis_redis1_latency_seconds_bucket{app="demo",node="001",service_name="demo",le="0.001",} 0.0
chassis_redis1_latency_seconds_bucket{app="demo",node="001",service_name="demo",le="0.001048576",} 0.0
chassis_redis1_latency_seconds_bucket{app="demo",node="001",service_name="demo",le="0.001398101",} 0.0
...
From the doc, Time window size will be the step size of the meter registry unless expiry in DistributionStatisticConfig is set to other value explicitly.
it looks like the expiry
controls both the max gauge and the counters. I need to set the expiry for the max gauge as you said it is useless if no decay, but it will make counters decay as well, which is a MUST-NOT for Prometheus. There should be two independent values expiryMaxGauge
and expiryCounters
to control the decay, or we simply ignore the expiry
on counters when PrometheusMeterRegistry is used.
What is the expected behavior for Prometheus here?
it looks like the expiry controls both the max gauge and the counters
To me I don't think it does. The doc in your quote mentions the "step" size, it is talking about "step" registries. Prometheus is "cumulative" so the part you quoted should not apply to it. Maybe we should somehow call this out in the docs. Please let us know if you have any suggestions.
I need to set the expiry for the max gauge as you said it is useless if no decay
You don't need to set anything to make it decay, max decays by default. Leaving buffer length and expiry on their defaults is also the recommended way.
but it will make counters decay as well
Did you try this? Prometheus has its own histogram (PrometheusHistogram
) where we do this: .expiry(Duration.ofDays(1825)) // effectively never rolls over
. So if you see decaying behavior on counters that must be a bug. In that case, could you please open a new issue with a minimal Java sample that reproduces the issue so that we can troubleshoot?
What is the expected behavior for Prometheus here?
Max (gauge) should decay, counters should not, please see the previous section and let us know if you can reproduce a behavior where this is not the case.
Did you try this? Prometheus has its own histogram (PrometheusHistogram) where we do this: .expiry(Duration.ofDays(1825)) // effectively never rolls over. So if you see decaying behavior on counters that must be a bug. In that case, could you please open a new issue with a minimal Java sample that reproduces the issue so that we can troubleshoot?
Not tested, but I read the source code and I guess the same. That's why I was confused that the Prometheus implementation is not like the docs says.
The doc in your quote mentions the "step" size, it is talking about "step" registries. Prometheus is "cumulative" so the part you quoted should not apply to it. Maybe we should somehow call this out in the docs. Please let us know if you have any suggestions.
So, a Timer generates a gauge and many counters:
- the gaugae
_max
will respect theexpiry
parameter and always has a decay time window - the counters:
_sum
,_count
and the buckets depends 2.1 they do not respectexpiry
with cumulative collectors likePrometheusMeterRegistry
(pull-based). 2.2 they respectexpiry
with step-based collectors likeStepMeterRegistry
(push-based).
Really appreciate for the clarification. I hope we can change the doc just a little bit, when I read the doc I was confused about what the "step" is in Promethues. If my comment above is correct, do you mind if I create a PR about the doc?
Please do not guess based on the code but test it. :)
What you wrote above seems right just one more thing: there are registries that do not support histograms at all and there are registries that do support cumulative and delta too.
If you have suggestions for the docs, please feel free to file a PR.
I have created a PR, pls review: https://github.com/micrometer-metrics/micrometer-docs/pull/333 @jonatan-ivanov
Closed by #333