smokeping_prober icon indicating copy to clipboard operation
smokeping_prober copied to clipboard

Calculate stddev metric

Open elcomtik opened this issue 5 years ago • 9 comments

I'm misssing possibility to calculate stddev over histograms in prometheus. However if they will implement it in future it will have some error compared to stddev produced over "raw" data, which exporter has.

It would be great if the exporter calculated it by similar way haw it calculates histograms and exported it be gauge metric. It wouldn't cost much processing time and storage on prometheus side too.

elcomtik avatar Mar 22 '20 14:03 elcomtik

It is most likely that prometheus will not implement calculation of stddev over histograms according to issue https://github.com/prometheus/prometheus/issues/7030

elcomtik avatar Mar 22 '20 17:03 elcomtik

it would be nice to have this directly in the exporter. the reason here is that we can do a longer ping interval (say, try pinging every 15 seconds or, better yet, every time prometheus asks us) but then ping for (say) 5 seconds and calculate the min/max/avg/stddev/loss of that.

that is how someone replaced Smokeping with Grafana here:

https://hveem.no/visualizing-latency-variance-with-grafana

unfortunately they used InfluxDB instead of a prometheus exporter, but this could work similarly.

anarcat avatar Jun 04 '20 03:06 anarcat

@anarcat That looks like it could be reproduced with several histogram_quantile() queries. There's no use of stddev there.

SuperQ avatar Jun 04 '20 04:06 SuperQ

@anarcat That looks like it could be reproduced with several histogram_quantile() queries. There's no use of stddev there.

The problem with that is to get a large enough sample size, you need to sample multiple metrics, which means you average over multiple scrape_interval periods. So, for example, if you want at least 4 samples, you will have an average over a minute with a 15s scrape_interval.

In contrast, smokeping does a bunch of quick pings (by default 20, with 500ms wait between them) and calculates one metric based on that.

This is what we (I think) mean here: instead of asking Prometheus to average multiple metrics over time, we want one metric to have min/max/avg/stddev. This is how smokeping draws those pretty graphs.

And sure, you could do the same thing over multiple samples in Prometheus itself, but you would definitely not get the resolution you get in Smokeping.

Otherwise you might as well just use the blackbox exporter... I don't quite see how this differs from the icmp probe there, actually...

anarcat avatar Jun 04 '20 14:06 anarcat

Yes, I know about smokeping's bursts of pings. IMO, smokeping's data model is flawed that way. This is where I intentionally deviated from the smokeping exact way of doing things. This prober sends a smooth, regular series of packets in order to be measuring at regular controlled intervals.

Instead of 20 packets, over 10 seconds, every minute. You send one packet per second and scrape every 15. This has the same overall effect, but the measurement is, IMO, more accurate, as it's a continuous stream. There's no 50 second gap of no metrics about the ICMP stream.

Also, you don't get back one metric for those 20 packets, you get several. Min, Max, Avg, StdDev. With the histogram data, you can calculate much more than just that using the raw data.

For example, IMO, avg and max are not all that useful for continuous stream monitoring. What I really want to know is the 90th percentile or 99th percentile.

This smokeping prober is not intended to be a one-to-one replacement for exactly smokeping's real implementation. But simply provide similar functionality, using the power of Prometheus and PromQL to make it better.

SuperQ avatar Jun 04 '20 14:06 SuperQ

For example, IMO, avg and max are not all that useful for continuous stream monitoring. What I really want to know is the 90th percentile or 99th percentile.

True, I guess...

how would you graph those statistics?

anarcat avatar Jun 04 '20 15:06 anarcat

If you just want a basic line, you can use the histogram_quantile() function.

histogram_quantile(0.9 rate(smokeping_response_duration_seconds_bucket[$__interval]))

But, one of the reason I prefer the histogram datatype, is you can use the heatmap panel type in Grafana, which is superior to the individual min/max/avg/stddev metrics that come from smokeping.

Say you had two routes, one slow and one fast. And some pings are sent over one and not the other. Rather than see a wide min/max equaling a wide stddev, the heatmap would show a "line" for both routes

SuperQ avatar Jun 04 '20 15:06 SuperQ

@SuperQ thanks for all the advice! I did end up using a heatmap in my own (blackbox-exporter-based) dashboard, and I guess that is "good enough" for my case for now. But I see the point of having buckets now...

See a longer discussion (quoting you here, I hope that's alright) here:

https://anarc.at/blog/2020-06-04-replacing-smokeping-prometheus/

anarcat avatar Jun 04 '20 16:06 anarcat

Hah! I don't mind the debate at all. A lot of this is hard to visualize without good examples, and even harder to discuss with just words. This project is something I want to spend more time on, but right now it's 3rd tier in my hobby projects.

SuperQ avatar Jun 04 '20 16:06 SuperQ