ratelimit icon indicating copy to clipboard operation
ratelimit copied to clipboard

Response time histograms using the prometheus sink are no longer in seconds

Open harpunius opened this issue 4 months ago • 0 comments

We migrated from statsD to the prometheus sink and use the following mapper snippet to monitor our infrastructure:

    - match: "ratelimit_server.*.response_time"
      name: "ratelimit_service_response_time_seconds"
      timer_type: histogram
      labels:
        grpc_method: "$1"

These metrics used to be output in seconds, but are now output in ms.

As stated in the statsd-exporter README:

Statsd timer data is transmitted in milliseconds, while Prometheus expects the unit to be seconds. The exporter converts all timer observations to seconds. Histogram and distribution events (h and d metric type) are not subject to unit conversion.

This used to happen when parsing observer events https://github.com/prometheus/statsd_exporter/blob/c18857b71b4afc2c304e4d34aa431a41234843ac/pkg/line/line.go#L82. In the new implementation, the histogram value is taken as-is: https://github.com/envoyproxy/ratelimit/blob/28b1629a21e885bdd2b527d6a1c1de8483dc47d4/src/stats/prom/prometheus_sink.go#L157.

This change (regression?) means that the default histogram buckets no longer make sense. I think we need to implement the same kind of unit switch.

WDYT?

harpunius avatar Oct 09 '24 09:10 harpunius