Support for redis_latency_percentiles_usec Metric in DragonFlyDB
Title: Support for redis_latency_percentiles_usec Metric in DragonFlyDB
Issue Summary:
When using the oliver006/redis_exporter with DragonFlyDB, we're not seeing the redis_latency_percentiles_usec metric that is typically available when the exporter is connected to Redis. This becomes a problem for monitoring setups — especially Grafana dashboards — that rely on this metric to track latency percentiles.
Environment:
- DragonFlyDB version: [latest Docker image]
- Redis Exporter version:
oliver006/redis_exporter:v1.51.0 - Prometheus version:
v2.44.0 - Grafana version:
9.5.2
Steps to Reproduce:
- Run both Redis and DragonFlyDB in the same environment (e.g., via Docker Compose).
- Set up
oliver006/redis_exporterfor each service with the same config. - Scrape metrics using Prometheus.
- Check for the
redis_latency_percentiles_usecmetric in Prometheus. - It appears for Redis but not for DragonFly.
What We Expected:
We expected both Redis and DragonFlyDB to expose this metric so we could reuse the same dashboards and alerts across both.
What Actually Happens:
The metric is missing from DragonFlyDB, so panels that rely on latency percentiles (like P99, P95, P50) are empty or broken when connected to a DragonFly instance.
Why This Matters:
Latency percentiles like P99, P95, and P50 are key metrics for understanding system performance — especially under load. They help identify tail latency and are widely used in SLOs and operational dashboards. Not having this data makes it harder to monitor and compare performance between Redis and DragonFlyDB.
Other Notes:
We did confirm that other metrics like redis_commands_processed_total and redis_commands_duration_seconds_count are being exposed correctly by both Redis and DragonFlyDB via the exporter. So the issue seems specific to the latency percentile metric.
We’re aware that DragonFly provides its own Prometheus endpoint with histograms that can be used to calculate percentiles via histogram_quantile(), but having redis_latency_percentiles_usec directly available would be a big help for compatibility with existing tools and dashboards.
Would love to know if this metric could be supported in the future, or if there's a recommended workaround for this use case.
Hi @swasthikshetty10 , thanks for creating the issue. I will look into this.
For the overview of algorithms that allow computing percentiles: https://docs.google.com/document/d/1CR2w1E799Ar5_3OyKowP3fsOgFAtUdsd-rJrVKArWbg/edit?tab=t.0
Valkey uses hdr_histogram (https://github.com/HdrHistogram/HdrHistogram_c)
we need to be able to merge statistics (since each thread manages its own). https://github.com/cafaro/UDDSketch talks about mergeability as its differrentiating property.
I am still curious how it is done in valkey, as it may have multiple io threads. In any case, based on my 10 min research UDDSketch may be the way to go.
we need histogram per command and each histogram contain multiple buckets. so it will be wasteful to also use thread-local data-structures for that. Long story short, I will use HdrHistogram_c that has atomicity support and I will try to limit the updates to avoid contention on this histogram.
To clarify - in addition to the info section like this:
# Latencystats
latency_percentiles_usec_ping:p50=2.007,p99=2.007,p99.9=2.007
latency_percentiles_usec_set:p50=13.055,p99=13.055,p99.9=13.055
....
there is also a latency histogram [cmd] command (currently not implemented)
but can be implemented as well.