iceberg
iceberg copied to clipboard
Prototype HLL buffers in manifest files to provide column distinct estimates.
Distinct counts aren't very valuable to cost-based optimization because they can't be easily merged. They should be removed. As a replacement, look into storing HLL buffers if they aren't too large.
Removed distinct counts in 75088f6875fc8d3cc4c3af38899742de1b919abf.
The Presto team has some code for HLL.
Format description - https://github.com/airlift/airlift/blob/master/stats/docs/hll.md Code - https://github.com/airlift/airlift/tree/master/stats/src/main/java/io/airlift/stats/cardinality
I need to play with it, but the summaries can be pretty large.