ratelimit
ratelimit copied to clipboard
Response time histograms using the prometheus sink are no longer in seconds
We migrated from statsD to the prometheus sink and use the following mapper snippet to monitor our infrastructure:
- match: "ratelimit_server.*.response_time"
name: "ratelimit_service_response_time_seconds"
timer_type: histogram
labels:
grpc_method: "$1"
These metrics used to be output in seconds, but are now output in ms.
As stated in the statsd-exporter README:
Statsd timer data is transmitted in milliseconds, while Prometheus expects the unit to be seconds. The exporter converts all timer observations to seconds. Histogram and distribution events (
h
andd
metric type) are not subject to unit conversion.
This used to happen when parsing observer events https://github.com/prometheus/statsd_exporter/blob/c18857b71b4afc2c304e4d34aa431a41234843ac/pkg/line/line.go#L82. In the new implementation, the histogram value is taken as-is: https://github.com/envoyproxy/ratelimit/blob/28b1629a21e885bdd2b527d6a1c1de8483dc47d4/src/stats/prom/prometheus_sink.go#L157.
This change (regression?) means that the default histogram buckets no longer make sense. I think we need to implement the same kind of unit switch.
WDYT?