chronon
chronon copied to clipboard
Add APPROX_HISTOGRAM_K Operation
Summary
Adds an APPROX_HISTOGRAM_K operation based on the FrequentItems Sketch: https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html
This is mostly a wrapper on top of the frequent items sketch, but have made the two following modifications to make it somewhat more user friendly:
- The sketch requires a k which is a power of two. Instead of making this a user requirement, we will take the next closest power of two and truncate down to k in the results.
- The underlying map used by the sketch has a load factor of 0.75, which causes the approximation to kick in before there are k elements. We changed this to keep an exact histogram/map and then switch over to the sketch once the entries exceed k.
Why / Goal
The current histogram operation uses unbounded memory and isn't stable for production use cases.
Test Plan
We've end to end tested everything through backfills/CLI on the batch side on our end.
- [x] Added Unit Tests
- [x] Covered by existing CI
- [x] Integration tested
Checklist
- [x] Documentation update