spire icon indicating copy to clipboard operation
spire copied to clipboard

Histogram instead of summary with quantile for latency metrics

Open AlexZzz opened this issue 1 year ago • 5 comments

  • Version: any
  • Platform: server and agent
  • Subsystem: any

There're a lot of metrics with type summary in spire. These metric type calculates requests count, sum of all latencies and latency distribuition as quantiles. There are two problems with quantiles:

  • It's hard to compute. Comuptation is running in the software (in spire in this case), not in the database, so it costs some CPU on every request
  • It's impossible to aggregate. Any aggregation on quantile will make no sense

Aggregation is the biggest problem in my case. It's impossible to create useful dashboard with a large number of metrics represented as quantiles if there're dozens, hundreds or thousands of spire-agents.

Much better option is to use histogram. One can calculate quantiles from histograms on the database. It won't be as accurate as quantiles from service, but accurate enough for most uses.

It would be nice to have latency histograms in spire:)

AlexZzz avatar Jul 17 '24 10:07 AlexZzz

SPIRE supports multiple telemetry backends: https://github.com/spiffe/spire/blob/v1.10.0/doc/telemetry_config.md

I think there may have been some work to massage telemetry data into a histogram like structure inside the M3 sink code ... perhaps it is possible to do the same thing for the Prometheus sink? I don't think statsd dogstatsd etc supports it though?

evan2645 avatar Jul 18 '24 19:07 evan2645

Unfortunately I work only with prometheus, not statsd/m3db/etc. Can't help with them 😞

Here there's something about "bins" in statsd. Probably it's the same as prometheus histograms. I couldn't find something similar for dogstatsd.

Do you think it's possible to make histograms code global? Not backend-dependent as for now done for m3?

AlexZzz avatar Jul 23 '24 14:07 AlexZzz

I think we're open to emitting some of these metrics as histograms. We'll need someone to figure out the best way to support this generically with our telemetry package (go-metrics) and supported backends.

azdagron avatar Aug 15 '24 18:08 azdagron

I dug into this a little bit. The Prometheus sink uses hashicorp/go-metrics, which uses Summaries by default for AddSample and AddSampleWithLabels. If we want to support histograms, we ultimately need to call Histogram.Observe instead of Summary.Observe.

I'm not familiar with all the backends, but the dogstatsd sink seems to be limited in a similar way, in that Datadog would support histograms if the metrics were submitted that way in the first place.

By way of comparison, the m3 sink depends on uber-go/tally instead, and adds additional methods to produce a histogramfrom within AddSample and AddSampleWithLabels.

So as it stands, I can think of a few options:

  1. Add histogram data to SPIRE by modifying one backend at a time, and dropping the dependency on go-metrics for that backend (just like the m3 sink). We'd want to add a configuration option to avoid breaking changes.
  2. Modify go-metrics to add histogram support.
  3. Replacing the use of go-metrics entirely and going with something new (e.g. OpenTelemetry), more broadly supported, and more flexible. This seems like a breaking change.

Any other ideas?

heymarcel avatar Aug 27 '24 18:08 heymarcel

This issue is stale because it has been open for 365 days with no activity.

github-actions[bot] avatar Nov 19 '25 22:11 github-actions[bot]

This issue was closed because it has been inactive for 30 days since being marked as stale.

github-actions[bot] avatar Dec 19 '25 22:12 github-actions[bot]