sozu icon indicating copy to clipboard operation
sozu copied to clipboard

local metrics aggregation for percentiles

Open Geal opened this issue 3 years ago • 0 comments

sozu implements in memory storage for metrics to have data quickly available to explore production issues.

Displaying this data has always been difficult because some of the metrics are latency percentiles, and since those are calculated on each worker, they cannot be aggregated, while counters can be aggregated (by addition) and gauges too (average, max, etc).

The data structure we use to calculate percentiles is a HdrHistogram. As it turns out, that structure can be easily serialized and histograms can be added to each other (as long as they have the same range of values, something we could guarantee).

The plan:

  • Instead of calculating percentiles for each metric (and per backend, per app, per time range, or globally) on the worker and returning them to the main process, we would return the serialized histograms (probably base64 encoded).
  • when aggregating the data from the workers, add the histograms to each other, then calculate the percentiles
  • that aggregation could be done in the main process, or even in command clients: the histograms could be transmitted further and aggregate data from multiple servers
  • we could store only the most precise data, for a 1 second range, per backend, per app, per worker, and aggregate (count, gauge or histogram) as needed, depending on the query: if we're asking global data, or for the last minute, or for all backends of an app, etc

Geal avatar Jan 02 '22 22:01 Geal