dragonfly icon indicating copy to clipboard operation
dragonfly copied to clipboard

Support for redis_latency_percentiles_usec Metric in DragonFlyDB

Open swasthikshetty10 opened this issue 8 months ago • 3 comments

Title: Support for redis_latency_percentiles_usec Metric in DragonFlyDB

Issue Summary:

When using the oliver006/redis_exporter with DragonFlyDB, we're not seeing the redis_latency_percentiles_usec metric that is typically available when the exporter is connected to Redis. This becomes a problem for monitoring setups — especially Grafana dashboards — that rely on this metric to track latency percentiles.

Environment:

  • DragonFlyDB version: [latest Docker image]
  • Redis Exporter version: oliver006/redis_exporter:v1.51.0
  • Prometheus version: v2.44.0
  • Grafana version: 9.5.2

Steps to Reproduce:

  1. Run both Redis and DragonFlyDB in the same environment (e.g., via Docker Compose).
  2. Set up oliver006/redis_exporter for each service with the same config.
  3. Scrape metrics using Prometheus.
  4. Check for the redis_latency_percentiles_usec metric in Prometheus.
  5. It appears for Redis but not for DragonFly.

What We Expected:

We expected both Redis and DragonFlyDB to expose this metric so we could reuse the same dashboards and alerts across both.

What Actually Happens:

The metric is missing from DragonFlyDB, so panels that rely on latency percentiles (like P99, P95, P50) are empty or broken when connected to a DragonFly instance.

Why This Matters:

Latency percentiles like P99, P95, and P50 are key metrics for understanding system performance — especially under load. They help identify tail latency and are widely used in SLOs and operational dashboards. Not having this data makes it harder to monitor and compare performance between Redis and DragonFlyDB.

Other Notes:

We did confirm that other metrics like redis_commands_processed_total and redis_commands_duration_seconds_count are being exposed correctly by both Redis and DragonFlyDB via the exporter. So the issue seems specific to the latency percentile metric.

We’re aware that DragonFly provides its own Prometheus endpoint with histograms that can be used to calculate percentiles via histogram_quantile(), but having redis_latency_percentiles_usec directly available would be a big help for compatibility with existing tools and dashboards.

Would love to know if this metric could be supported in the future, or if there's a recommended workaround for this use case.

swasthikshetty10 avatar May 09 '25 14:05 swasthikshetty10

Hi @swasthikshetty10 , thanks for creating the issue. I will look into this.

BagritsevichStepan avatar May 16 '25 05:05 BagritsevichStepan

For the overview of algorithms that allow computing percentiles: https://docs.google.com/document/d/1CR2w1E799Ar5_3OyKowP3fsOgFAtUdsd-rJrVKArWbg/edit?tab=t.0

Valkey uses hdr_histogram (https://github.com/HdrHistogram/HdrHistogram_c)

romange avatar Jun 01 '25 09:06 romange

we need to be able to merge statistics (since each thread manages its own). https://github.com/cafaro/UDDSketch talks about mergeability as its differrentiating property.

I am still curious how it is done in valkey, as it may have multiple io threads. In any case, based on my 10 min research UDDSketch may be the way to go.

romange avatar Jun 01 '25 09:06 romange

we need histogram per command and each histogram contain multiple buckets. so it will be wasteful to also use thread-local data-structures for that. Long story short, I will use HdrHistogram_c that has atomicity support and I will try to limit the updates to avoid contention on this histogram.

romange avatar Jun 20 '25 17:06 romange

To clarify - in addition to the info section like this:

# Latencystats
latency_percentiles_usec_ping:p50=2.007,p99=2.007,p99.9=2.007
latency_percentiles_usec_set:p50=13.055,p99=13.055,p99.9=13.055
....

there is also a latency histogram [cmd] command (currently not implemented) but can be implemented as well.

romange avatar Jun 25 '25 13:06 romange