Allow `synapse_http_server_response_time_seconds` Grafana histogram quantiles to show values bigger than 10s
Allow synapse_http_server_response_time_seconds Grafana histogram quantiles to show values bigger than 10s
If you use the default histogram buckets, or guess poorly (likely) the first time around, you will probably see a straight line at 10 (or your highest bucket boundary) when you calculate quantiles on this histogram. This is due to the fact that your quantiles or most of your observations are found in the +Inf bucket. No linear approximation saves you here, the only “sane” thing to do is return the lower known bound.
-- https://linuxczar.net/blog/2017/06/15/prometheus-histogram-2/
By default, these buckets are: .005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10. This is very much tuned to measuring request durations below 10 seconds, so if you’re measuring something else you may need to configure custom buckets.
-- https://tomgregory.com/the-four-types-of-prometheus-metrics/
you have to pre-choose your buckets, and the costs moves from the client to Prometheus itself due to bucket cardinality. The default ten buckets cover a typical web service with latency in the millisecond to second range, and on occasion you will want to adjust them.
-- https://www.robustperception.io/how-does-a-prometheus-histogram-work/
Part of https://github.com/matrix-org/synapse/issues/13356
Before
Purple line >99% percentile has false max ceiling of 10s because the values don't go above 10.
https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&var-datasource=default&var-bucket_size=%24__auto_interval_bucket_size&var-instance=matrix.org&var-job=synapse_client_reader&var-index=All&from=1660039325520&to=1660060925520&viewPanel=152

After
I don't know if this actually fixes it (haven't tested).
Dev notes
- https://github.com/kiali/kiali/issues/1021
- https://www.robustperception.io/how-does-a-prometheus-histogram-work/
Docs:
- https://prometheus.io/docs/practices/histograms/
- https://prometheus.io/docs/tutorials/understanding_metric_types/#histogram
- https://prometheus.io/docs/concepts/metric_types/#histogram
https://github.com/prometheus/client_python/blob/5a5261dd45d65914b5e3d8225b94d6e0578882f3/prometheus_client/metrics.py#L544
DEFAULT_BUCKETS = (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5, 5.0, 7.5, 10.0, INF)
synapse_http_server_response_time_seconds_bucketsynapse_http_server_response_time_seconds_sumsynapse_http_server_response_time_seconds_count
Pull Request Checklist
- [x] Pull request is based on the develop branch
- [x] Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from
EventStoretoEventWorkerStore.". - Use markdown where necessary, mostly for
code blocks. - End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from
- [ ] ~~Pull request includes a sign off~~
- [x] Code style is correct (run the linters)
I'm not sure if we really want this. There have been complaints in the past that the synapse_http_server_response_time collection of metrics is too big and causes performance issues with Prometheus:
- https://github.com/matrix-org/synapse/issues/11082
Adding ~30% more buckets seems like a step in the wrong direction for #11082. Is there a particular insight we're hoping to gain by raising the 10 second cap?
Hang on two secs, I'm a bit concerned by removing the 75 buckets in terms of losing the definition in the common cases
Sorry for not jumping in on this sooner, but: the vast majority of our APIs are responding within the range of 0-10s, so losing fidelity there reduces our insights into response time. There is quite a big difference between APIs that return in 500ms and those that return in 1s, and removing the 750ms means we can't easily differentiate.
Since this a thing that we're adding specifically to measure progress in performance improvements for a particular API, I'm very tempted to suggest that we simply create a separate metric for the /messages API. This would also allow us to differentiate between local and remote /messages requests for example.