prom-client icon indicating copy to clipboard operation
prom-client copied to clipboard

Best way to do total count in case of counter reset ?

Open somyarocketium opened this issue 5 years ago • 13 comments

I am using prometheus + grafana, and wrote a simple query like sum(increase(some_count[$__range])) , although this takes the difference between the first and last values, seeing prometheus values gets restart on restart of our node server, any better approach to tackle this problem ? Sample graph explaining the drop due to server restart

somyarocketium avatar Apr 09 '20 02:04 somyarocketium

Seems like a prometheus question, not prom-client. Have you googled around?

https://groups.google.com/forum/#!topic/prometheus-developers/JaSY73UQv70

https://www.robustperception.io/rate-then-sum-never-sum-then-rate

both seem a bit relevant.

Maybe use rate()? Its robust to counter resets.

sam-github avatar Apr 13 '20 17:04 sam-github

Yes read a lot but couldn't find the exact solution, using rate is also similar to increase will give the same graph

somyarocketium avatar Apr 18 '20 01:04 somyarocketium

This comes up a lot and unfortunately it's not something that can really be handled in this library. Using push-gateway is an option but requires some additional setup. (It'd probably be worth while to write up an example of this by now.)

rate/irate as sam-github suggested is also a fairly good solution if you can ignore the spikes.

zbjornson avatar May 10 '20 00:05 zbjornson

@brian-brazil do you have any input how to handle this sort of issue (persisting metric resets either when an app thread [cluster worker] crashes and respawns, or when the app itself restarts)? From what I understand, any improvement we could make in this library would run counter to the stateless design guidelines for Prometheus clients. This issue comes up a lot though and push-gateway feels like a cumbersome solution.

See also #207, #385.

zbjornson avatar Aug 23 '20 20:08 zbjornson

If a process dies it's normal for counters reset, let rate worry about it. Don't try to persist counters across runs of an application.

I don't see the link to the pushgateway.

brian-brazil avatar Aug 23 '20 20:08 brian-brazil

Thanks!

pushgateway came up as a possible way to persist counters, e.g. by pushing metrics in an app shutdown hook. I totally understand the desire to persist counters -- if the metric doesn't lend itself to a rate (e.g. total page views), then the resets are problematic.

zbjornson avatar Aug 23 '20 20:08 zbjornson

if the metric doesn't lend itself to a rate (e.g. total page views), then the resets are problematic.

That's a counter, so rate should work fine.

brian-brazil avatar Aug 23 '20 20:08 brian-brazil

Hm, I'm missing something then. Getting the total number of page views [since some date] from a rate requires taking the integral, right? Do Prometheus or Grafana have that capability? That also seems like it would require setting your Prometheus retention period to "forever".

zbjornson avatar Aug 23 '20 20:08 zbjornson

Getting the total number of page views [since some date] from a rate requires taking the integral, right? Do Prometheus or Grafana have that capability?

Yes, that's what rate() does basically.

That also seems like it would require setting your Prometheus retention period to "forever".

That's a bit of a different use case, and out of scope for Prometheus. If you want a perfect count of how many times something has happened ever, logs are usually the appropriate solution. Prometheus works over arbitrary time periods of a specified duration, not unbounded time periods.

brian-brazil avatar Aug 23 '20 22:08 brian-brazil

Thanks @brian-brazil !

zbjornson avatar Aug 26 '20 23:08 zbjornson

@somyarocketium With prometheus + grafana I believe you can get the grand total of all increase in the current window by using the legend. You can do something like this Query: Screen Shot 2021-03-31 at 1 50 13 PM Setting $__interval will automatically use the minimum x-axis time unit based on your window resolution. $__interval is an auto scaling feature you can use with prometheus + grafana. Make sure to also set min step to 1m.

Visualization > Legend Screen Shot 2021-03-31 at 1 50 26 PM The total that is reflected in the legend will be the total increase that happened in your window. This should work regardless if a counter reset happened

Result: Screen Shot 2021-03-31 at 1 54 15 PM

scyantist avatar Mar 31 '21 20:03 scyantist

Just to leave this here:

sum by (label) (increase(metric[$__range]))

Is what I was looking for when I came here via search.

tcurdt avatar Apr 22 '22 13:04 tcurdt