operations Use p95 instead of average for response time metrics

Use p95 instead of average for response time metrics

Open pnorman opened this issue 3 years ago • 4 comments

We are fairly consistently monitoring median response times, e.g. tile rendering tile retrieval latency, web site response time.

A better metric to monitor is p95 or p99, as this can better capture the user experience of slow loads.

I'm trying to figure out how to do this for the tile CDN

Jul 26 '22 21:07 pnorman

Do we have the right metrics for that? Trying to under the histogram and *ile metrics always makes my head hurt...

Jul 26 '22 21:07 tomhughes

I'm not sure. I was able to get it from logs for the tile CDN. Some stuff we probably only have the average for, so will need to open upstream issues.

Jul 26 '22 22:07 pnorman

As far as I know the only places we have dash boards for response time is for the main web site and nominatim.

The main web site is just based on total time and count and doesn't have any sort of break down in the metric.

The nominatim one does have a histogram metric, and we display the histogram data as well as some overall averages.

Doing p95 (or any other quantile type values) can either be done with a summary metric or from a histogram metric with the histogram_quantile function according to https://prometheus.io/docs/practices/histograms/.

Jul 26 '22 23:07 tomhughes

As an example this should do p95 for nominatim by request type:

histogram_quantile(0.95, sum by (type, le) (rate(nominatim_request_duration_seconds_bucket[$__rate_interval])))

Jul 26 '22 23:07 tomhughes

operations operations copied to clipboard

Use p95 instead of average for response time metrics

operations
operations copied to clipboard