micrometer-docs icon indicating copy to clipboard operation
micrometer-docs copied to clipboard

Interpretation: What does max measure in Timer and DistributionSummary

Open tarunrathor-pro opened this issue 5 years ago • 13 comments

Based on available documentation it is not clear what does the following micrometer/promethues measure. http_server_requests_seconds_max

tarunrathor-pro avatar Apr 01 '19 15:04 tarunrathor-pro

It measures how long the longest request took for a given uri tag in the last minute (I believe that is configurable).

If a given uri hasn't been called in that minute then you won't have a value.

checketts avatar Apr 01 '19 15:04 checketts

@checketts Thanks for the answer. In Prometheus what is the TYPE of this metric

tarunrathor-pro avatar Apr 01 '19 15:04 tarunrathor-pro

Off the top of my head it would be a gauge.

checketts avatar Apr 01 '19 16:04 checketts

To make this issue actionable @tarunrathor-pro Can you provide a link to the documentation that you found confusing/incomplete?

checketts avatar Apr 01 '19 16:04 checketts

The problem is that documentation on both spring and micrometer site does not describe this. Its just other sites which mention this without explaining it well. Having some official documentation will help

tarunrathor-pro avatar Apr 01 '19 16:04 tarunrathor-pro

I want to implement a similar metric for node.js using prom-client so that we can uniformly define this metrics across java/node.js. Was curious to know the measurement , TYPE and logic to arrive at this in micrometer so that i can do a similar implementation for node.js

tarunrathor-pro avatar Apr 01 '19 16:04 tarunrathor-pro

At a high level is it an addition to Timer and is implemented via TimeWIndowMax.

@jkschneider Would know the finer points. But my high level understanding is they use a ring buffer to hold onto just the last minute's values and report that max that happened during that time.

checketts avatar Apr 01 '19 16:04 checketts

When looking at the Prometheus endpoint for my Micrometer app, it gives the TYPE for every metric like

# TYPE http_server_requests_seconds_max gauge

shakuzen avatar Apr 02 '19 02:04 shakuzen

Did anyone figure out what the http_server_requests_seconds_max metric represents?

I've observed that it's not the maximum value as it's well below the average reported by the http_server_requests_seconds metric. The TimeWindowMax class describes it as a decaying maximum for a distribution based on a configurable ring buffer.

If anyone has any insights please let me know as I'm writing an article on these metrics. Happy to create a PR to include documentation too, if that helps.

tkgregory avatar May 21 '20 06:05 tkgregory

Since this issue was opened, further clarification has been added to the documentation under the Timer section. See the NOTE in section "10. Timers" https://micrometer.io/docs/concepts#_timers

Max for basic Timer implementations such as CumulativeTimer, StepTimer is a time window max (TimeWindowMax). It means that its value is the maximum value during a time window. If no new values are recorded for the time window length, the max will be reset to 0 as a new time window starts. Time window size will be the step size of the meter registry unless expiry in DistributionStatisticConfig is set to other value explicitly. The reason why a time window max is used is to capture max latency in a subsequent interval after heavy resource pressure triggers the latency and prevents metrics from being published.

@tarunrathor-pro and @tkgregory could you take a look at that and let us know if you think additional clarification is warranted. If so, what is lacking or confusing?

shakuzen avatar May 21 '20 07:05 shakuzen

@shakuzen that's a lot clearer now. I obviously missed these docs in my search so thanks for pointing me in the right direction.

tkgregory avatar May 23 '20 16:05 tkgregory

Although the note exists in the documentation today, a small section more prominently explaining the max and specifically TimeWindowMax is probably warranted. I'm moving this issue to the docs repo, which is where changes would be made. We'll have to figure out how to best work it into the documentation (suggestions welcome).

shakuzen avatar May 28 '20 03:05 shakuzen

I'd just like to second this request. We were trying to determine exactly what the _max metric generated by a Timer represented, and what time period it was calculated over, and couldn't find the answer anywhere until we found this issue.

Although one of the first places I looked was in the micrometer documentation section on Timers that you referred me to above, it wasn't at all clear to me that the NOTE referencing TimeWindowMax that the replies above was (in part) an answer to our questions.

I think it should be much more explicitly documented that for any Timer, separate metrics with the _count, _sum, and _max suffixes (for Prometheus, at least) will be generated. And what the default time window is over which the _max value calculated (somebody above said 1 minute?), and how to change that time window, if we wish to do so.

aprevost avatar Mar 18 '22 21:03 aprevost

Time window size will be the step size of the meter registry unless expiry in DistributionStatisticConfig is set to other value explicitly

Prometheus requires the bucket values to be accumulative and never decay, so I must set the expiry to null for Prometheus. However this _max is a gauge calculated on a time window, and it must expire and can go down, there must be an expiry for it. So, does it mean the _max will never go down for the Prometheus?

Is this expiry configuration only for "push-based" endpoints, not for "pull-based" like Prometheus? If so, how is the _max gauge calculated when it is scraped from Promethues?

DanielYWoo avatar Sep 30 '23 15:09 DanielYWoo

Prometheus requires the bucket values to be accumulative and never decay

Buckets in a Prometheus Histogram are Counters so they are monotonic (never decay). But Micrometer's Max is a Gauge not a Histogram, so monotonicity does not apply to them, a Gauge can go up or down freely.

so I must set the expiry to null for Prometheus.

I don't think so, see above. Also, a never expiring max is not very useful in general.

So, does it mean the _max will never go down for the Prometheus?

No, max will decay, see above.

Is this expiry configuration only for "push-based" endpoints, not for "pull-based" like Prometheus? If so, how is the _max gauge calculated when it is scraped from Promethues?

This is for most of the registries including Prometheususing, see: TimeWindowMax.

jonatan-ivanov avatar Oct 02 '23 00:10 jonatan-ivanov

Max for basic Timer implementations such as CumulativeTimer, StepTimer is a time window max (TimeWindowMax). It means that its value is the maximum value during a time window. If no new values are recorded for the time window length, the max will be reset to 0 as a new time window starts. Time window size will be the step size of the meter registry unless expiry in DistributionStatisticConfig is set to other value explicitly.

Let's say I have a Timer

Timer.builder("chassis.redis1.latency") // Note, "." will be converted to "_" for Prometheus format
     .maximumExpectedValue(Duration.ofMillis(1000))
     .publishPercentileHistogram()
     .register(meterRegistry).record(someDuration, TimeUnit.MILLISECONDS);

It typically generates a max gauge, and some counters (_sum, _count and the buckets). e..g.,

# HELP chassis_redis1_latency_seconds_max
# TYPE chassis_redis1_latency_seconds_max gauge
chassis_redis1_latency_seconds_max{app="demo",node="001",service_name="demo",} 1.133
# HELP chassis_redis1_latency_seconds
# TYPE chassis_redis1_latency_seconds histogram
chassis_redis1_latency_seconds_count{app="demo",node="001",service_name="demo",} 200.0
chassis_redis1_latency_seconds_sum{app="demo",node="001",service_name="demo",} 204.375
chassis_redis1_latency_seconds_bucket{app="demo",node="001",service_name="demo",le="0.001",} 0.0
chassis_redis1_latency_seconds_bucket{app="demo",node="001",service_name="demo",le="0.001048576",} 0.0
chassis_redis1_latency_seconds_bucket{app="demo",node="001",service_name="demo",le="0.001398101",} 0.0
...

From the doc, Time window size will be the step size of the meter registry unless expiry in DistributionStatisticConfig is set to other value explicitly. it looks like the expiry controls both the max gauge and the counters. I need to set the expiry for the max gauge as you said it is useless if no decay, but it will make counters decay as well, which is a MUST-NOT for Prometheus. There should be two independent values expiryMaxGauge and expiryCounters to control the decay, or we simply ignore the expiry on counters when PrometheusMeterRegistry is used.

What is the expected behavior for Prometheus here?

DanielYWoo avatar Oct 02 '23 02:10 DanielYWoo

it looks like the expiry controls both the max gauge and the counters

To me I don't think it does. The doc in your quote mentions the "step" size, it is talking about "step" registries. Prometheus is "cumulative" so the part you quoted should not apply to it. Maybe we should somehow call this out in the docs. Please let us know if you have any suggestions.

I need to set the expiry for the max gauge as you said it is useless if no decay

You don't need to set anything to make it decay, max decays by default. Leaving buffer length and expiry on their defaults is also the recommended way.

but it will make counters decay as well

Did you try this? Prometheus has its own histogram (PrometheusHistogram) where we do this: .expiry(Duration.ofDays(1825)) // effectively never rolls over. So if you see decaying behavior on counters that must be a bug. In that case, could you please open a new issue with a minimal Java sample that reproduces the issue so that we can troubleshoot?

What is the expected behavior for Prometheus here?

Max (gauge) should decay, counters should not, please see the previous section and let us know if you can reproduce a behavior where this is not the case.

jonatan-ivanov avatar Oct 02 '23 04:10 jonatan-ivanov

Did you try this? Prometheus has its own histogram (PrometheusHistogram) where we do this: .expiry(Duration.ofDays(1825)) // effectively never rolls over. So if you see decaying behavior on counters that must be a bug. In that case, could you please open a new issue with a minimal Java sample that reproduces the issue so that we can troubleshoot?

Not tested, but I read the source code and I guess the same. That's why I was confused that the Prometheus implementation is not like the docs says.

The doc in your quote mentions the "step" size, it is talking about "step" registries. Prometheus is "cumulative" so the part you quoted should not apply to it. Maybe we should somehow call this out in the docs. Please let us know if you have any suggestions.

So, a Timer generates a gauge and many counters:

  1. the gaugae _max will respect the expiry parameter and always has a decay time window
  2. the counters: _sum, _count and the buckets depends 2.1 they do not respect expiry with cumulative collectors like PrometheusMeterRegistry (pull-based). 2.2 they respect expiry with step-based collectors like StepMeterRegistry (push-based).

Really appreciate for the clarification. I hope we can change the doc just a little bit, when I read the doc I was confused about what the "step" is in Promethues. If my comment above is correct, do you mind if I create a PR about the doc?

DanielYWoo avatar Oct 04 '23 01:10 DanielYWoo

Please do not guess based on the code but test it. :)

What you wrote above seems right just one more thing: there are registries that do not support histograms at all and there are registries that do support cumulative and delta too.

If you have suggestions for the docs, please feel free to file a PR.

jonatan-ivanov avatar Oct 04 '23 02:10 jonatan-ivanov

I have created a PR, pls review: https://github.com/micrometer-metrics/micrometer-docs/pull/333 @jonatan-ivanov

DanielYWoo avatar Oct 05 '23 03:10 DanielYWoo

Closed by #333

jonatan-ivanov avatar Jan 25 '24 22:01 jonatan-ivanov