starrocks icon indicating copy to clipboard operation
starrocks copied to clipboard

HLL function with higher precision

Open chen9t opened this issue 2 years ago • 3 comments

  • [x] https://github.com/StarRocks/starrocks/pull/20836

Feature request

Is your feature request related to a problem? Please describe.

The HyperLogLog (HLL) algorithm is a probabilistic algorithm used to estimate the cardinality (number of distinct elements) of a set with a high degree of accuracy, while using minimal memory.

StarRocks also has structure and functions related to HLL. But the precision of HLL algorithm is limited to 2^14, because the bias of current HLL algorithm is hard coded in hll.cpp.:

double bias = 5.9119 * 1.0e-18 * (estimate * estimate * estimate * estimate) -
              1.4253 * 1.0e-12 * (estimate * estimate * estimate) + 1.2940 * 1.0e-7 * (estimate * estimate) -
              5.2921 * 1.0e-3 * estimate + 83.3216;

So the current algorithm only works well in precision 2^14.

Describe the solution you'd like

Keep the current HLL structure and functions, add one function uniq_combined() to support higher precision, and the precision can be changed with parameter.

Describe alternatives you've considered

Introduce HLL function uniq_combined() which powered by Apache Datasketches. This project is dedicated to providing a broad selection of sketch algorithms of production quality which can produce results orders-of magnitude faster and with mathematically proven error bounds.

Maybe more functions can be developed based on this toolkit, which allows simplification of system's architecture and reduction in overall compute resources required for these heretofore difficult computational tasks.

chen9t avatar Apr 02 '23 03:04 chen9t

Will split PR #20832 into 2 PRs:

  1. Introduce datasketches toolkit;
  2. New HLL function based on datasketches.

chen9t avatar Apr 02 '23 06:04 chen9t

@kangkaisen Needn't the PR #20836 be merged before closing this issue?

jaogoy avatar Jul 04 '23 08:07 jaogoy

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!

github-actions[bot] avatar Jan 08 '24 11:01 github-actions[bot]