operations
operations copied to clipboard
Use p95 instead of average for response time metrics
We are fairly consistently monitoring median response times, e.g. tile rendering tile retrieval latency, web site response time.
A better metric to monitor is p95 or p99, as this can better capture the user experience of slow loads.
I'm trying to figure out how to do this for the tile CDN
Do we have the right metrics for that? Trying to under the histogram and *ile metrics always makes my head hurt...
I'm not sure. I was able to get it from logs for the tile CDN. Some stuff we probably only have the average for, so will need to open upstream issues.
As far as I know the only places we have dash boards for response time is for the main web site and nominatim.
The main web site is just based on total time and count and doesn't have any sort of break down in the metric.
The nominatim one does have a histogram metric, and we display the histogram data as well as some overall averages.
Doing p95 (or any other quantile type values) can either be done with a summary metric or from a histogram metric with the histogram_quantile function according to https://prometheus.io/docs/practices/histograms/.
As an example this should do p95 for nominatim by request type:
histogram_quantile(0.95, sum by (type, le) (rate(nominatim_request_duration_seconds_bucket[$__rate_interval])))