chronon icon indicating copy to clipboard operation
chronon copied to clipboard

Add APPROX_HISTOGRAM_K Operation

Open jbrooks-stripe opened this issue 10 months ago • 1 comments

Summary

Adds an APPROX_HISTOGRAM_K operation based on the FrequentItems Sketch: https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html

This is mostly a wrapper on top of the frequent items sketch, but have made the two following modifications to make it somewhat more user friendly:

  • The sketch requires a k which is a power of two. Instead of making this a user requirement, we will take the next closest power of two and truncate down to k in the results.
  • The underlying map used by the sketch has a load factor of 0.75, which causes the approximation to kick in before there are k elements. We changed this to keep an exact histogram/map and then switch over to the sketch once the entries exceed k.

Why / Goal

The current histogram operation uses unbounded memory and isn't stable for production use cases.

Test Plan

We've end to end tested everything through backfills/CLI on the batch side on our end.

  • [x] Added Unit Tests
  • [x] Covered by existing CI
  • [x] Integration tested

Checklist

  • [x] Documentation update

Reviewers

jbrooks-stripe avatar Mar 29 '24 15:03 jbrooks-stripe