gofr icon indicating copy to clipboard operation
gofr copied to clipboard

Increase SQL Query Histogram Range Beyond 10ms for Latency Tracking

Open aryanmehrotra opened this issue 2 months ago • 2 comments

The current 10 ms upper bound on GoFr’s SQL query histogram quantiles is great for ultra-fast query visibility (for apps serving millions of requests per minute), but it compresses all slower queries into the same top bucket, which makes real-world latency analysis difficult — especially in production workloads where 100 ms–5 s queries are normal for analytical or transactional DBs.


⚙️ Why 10 ms was used initially

  • It was designed to detect micro-latency regressions in high-throughput services (like internal GoFr benchmarks).
  • Early use cases focused on microservices hitting in-memory or well-indexed SQL queries where anything >10 ms was an anomaly.
  • It optimized for Prometheus cardinality — fewer histogram buckets = smaller metrics footprint.

Why that’s limiting in real workloads

  • Queries taking 20 ms, 200 ms, or 2 s all fall into the same bucket — you lose resolution.
  • Makes it impossible to differentiate between “slightly slow” and “critical” queries.
  • You can’t correlate query latency with API request latency effectively.
  • Makes alerting thresholds (like P95 or P99 query latency) misleadingly small.

Recommended Fix

Adopt a wider histogram range with logarithmic or percentile-style bucket spacing.

Example ideal bucket set for SQL histograms:

[]float64{
    0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 20, 30, 60,
}

This covers:

  • Microsecond-level detail for fast queries
  • Smooth spread for mid-range queries
  • Full visibility up to 60 seconds for extreme cases

🧩 Dynamic Adjustment (Could be an option)

Instead of hardcoding buckets, expose them via configuration — for example:

app.Metrics.SetHistogramBuckets("sql_query_duration_seconds", []float64{
    0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 30, 60,
})

You could tie this to:

  • Datasource type (MySQL, Postgres, BigQuery, etc.)
  • Request timeout (e.g., if REQUEST_TIMEOUT=30s, upper bound could be 30 s)
  • Environment (dev → short, prod → wide)

📊 Impact

  • Slightly higher Prometheus storage footprint (more buckets)

  • Far more actionable insights:

    • 95th percentile query time per DB
    • Breakdown of fast vs slow queries
    • Easier detection of query regressions after deployments

aryanmehrotra avatar Oct 27 '25 15:10 aryanmehrotra

Hi @aryanmehrotra, I’d love to fix to this issue! I have some experience with Go and performance instrumentation, and I’d like to help implement the wider SQL query histogram range as described. Could you please assign this issue to me or guide me on how to proceed?

nitin-999-code avatar Nov 11 '25 16:11 nitin-999-code

  • This issue is being tracked in PR #2526

Umang01-hash avatar Nov 13 '25 08:11 Umang01-hash