synapse Allow `synapse_http_server_response_time_seconds` Grafana histogram quantiles to show values bigger than 10s

Allow synapse_http_server_response_time_seconds Grafana histogram quantiles to show values bigger than 10s

If you use the default histogram buckets, or guess poorly (likely) the first time around, you will probably see a straight line at 10 (or your highest bucket boundary) when you calculate quantiles on this histogram. This is due to the fact that your quantiles or most of your observations are found in the +Inf bucket. No linear approximation saves you here, the only “sane” thing to do is return the lower known bound.

-- https://linuxczar.net/blog/2017/06/15/prometheus-histogram-2/

By default, these buckets are: .005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10. This is very much tuned to measuring request durations below 10 seconds, so if you’re measuring something else you may need to configure custom buckets.

-- https://tomgregory.com/the-four-types-of-prometheus-metrics/

you have to pre-choose your buckets, and the costs moves from the client to Prometheus itself due to bucket cardinality. The default ten buckets cover a typical web service with latency in the millisecond to second range, and on occasion you will want to adjust them.

-- https://www.robustperception.io/how-does-a-prometheus-histogram-work/

Part of https://github.com/matrix-org/synapse/issues/13356

Before

Purple line >99% percentile has false max ceiling of 10s because the values don't go above 10.

https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&var-datasource=default&var-bucket_size=%24__auto_interval_bucket_size&var-instance=matrix.org&var-job=synapse_client_reader&var-index=All&from=1660039325520&to=1660060925520&viewPanel=152

After

I don't know if this actually fixes it (haven't tested).

Dev notes

https://github.com/kiali/kiali/issues/1021
https://www.robustperception.io/how-does-a-prometheus-histogram-work/

Docs:

https://prometheus.io/docs/practices/histograms/
https://prometheus.io/docs/tutorials/understanding_metric_types/#histogram
https://prometheus.io/docs/concepts/metric_types/#histogram

https://github.com/prometheus/client_python/blob/5a5261dd45d65914b5e3d8225b94d6e0578882f3/prometheus_client/metrics.py#L544

DEFAULT_BUCKETS = (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5, 5.0, 7.5, 10.0, INF)

synapse_http_server_response_time_seconds_bucket
synapse_http_server_response_time_seconds_sum
synapse_http_server_response_time_seconds_count

Pull Request Checklist

[x] Pull request is based on the develop branch
[x] Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
[ ] ~~Pull request includes a sign off~~
[x] Code style is correct (run the linters)

Aug 09 '22 03:08 MadLittleMods

I'm not sure if we really want this. There have been complaints in the past that the synapse_http_server_response_time collection of metrics is too big and causes performance issues with Prometheus:

https://github.com/matrix-org/synapse/issues/11082

Adding ~30% more buckets seems like a step in the wrong direction for #11082. Is there a particular insight we're hoping to gain by raising the 10 second cap?

Aug 11 '22 10:08 squahtx

Hang on two secs, I'm a bit concerned by removing the 75 buckets in terms of losing the definition in the common cases

Aug 12 '22 10:08 erikjohnston

Sorry for not jumping in on this sooner, but: the vast majority of our APIs are responding within the range of 0-10s, so losing fidelity there reduces our insights into response time. There is quite a big difference between APIs that return in 500ms and those that return in 1s, and removing the 750ms means we can't easily differentiate.

Since this a thing that we're adding specifically to measure progress in performance improvements for a particular API, I'm very tempted to suggest that we simply create a separate metric for the /messages API. This would also allow us to differentiate between local and remote /messages requests for example.

Aug 12 '22 10:08 erikjohnston