Optimize DistinctCountHLL aggregation for high-cardinality dictionary-encoded columns
Problem
The DISTINCTCOUNTHLL aggregation function suffers from severe performance degradation when processing high-cardinality dictionary-encoded columns (14 Million). Profiling shows that 50% of CPU time is spent in RoaringBitmap operations:
For dictionary-encoded columns, the current implementation uses RoaringBitmap to track dictionary IDs during aggregation. While memory-efficient for low cardinality, this approach has O(n log n) insertion complexity that becomes prohibitively expensive for high-cardinality columns (>100K distinct values).
Queries on high-cardinality columns (1M - 15M) (e.g., user IDs, member) takes about 6 - 10sec. RoaringBitmap operations dominate query execution time. No performance benefit from using HLL over distinct count
Proposed Solution
Implement adaptive cardinality handling that dynamically switches from RoaringBitmap to HyperLogLog:
- Low cardinality : Use RoaringBitmap (memory efficient, exact counts)
- High cardinality : Convert to HyperLogLog (O(1) insertions)
Tested with a POC code where we choose HyperLogLog for High- cardinality column and observed improvements from 8sec -> 700ms
Wow.. We never expect RoaringBitmap to be slow.. Why is it having O(n log n) insertion complexity?
Wow.. We never expect
RoaringBitmapto be slow.. Why is it havingO(n long n)insertion complexity?
From the reading I did on this looks like RoaringBitmap
- Looks up the right container for the high bits
- binary search on the sorted short[] to find the position → O(log k). if not present: shift tail of the array to insert → O(k) in the worst case
- Per insertion: O(log k + k_shift and across millions of distinct inserts, the total cost is closer to O(n log n) (plus copies)
Can you also try out the BitSet which guarantees O(1) complexity per insert?
I feel the tradeoff is between:
- Memory:
RoaringBitmaphas better compaction with less insert on a wider search space - Performance: Besides the allocation,
BitSethas more guaranteed performance
Currently RoaringBitmap is used to reduce the dictionary lookup. For low cardinality, BitSet might always outperform RoaringBitmap. For high cardinality, when the same dictionary id repeats a lot, directly inserting into HLL might produce worse performance
Thanks @Jackie-Jiang , Let me consider that when testing this.
For high cardinality, when the same dictionary id repeats a lot, directly inserting into HLL might produce worse performance
In the segment I tested (~25M rows, ~14M cardinality), HLL outperformed for the query I tested which produced 11 million distinct. That said, I agree the picture flips in the low-cardinality case. If we had something like 25M rows and ~10K distinct, then:
- BitSet should likely win on throughput thanks to O(1) set operations and predictable access patterns, while HLL may end up paying similar per-row cost without gaining much.
Replacing Bitset with RoaringBitMap might solve the performance issue but might have memory implication. So I was thinking to introduce a switch threshold (similar in spirit to “smartHLL,”): start with an exact structure (e.g., BitSet/Roaring), and only promote to HLL once the observed distinct count / density indicates it’s worth it. I’ll test a range of cardinalities and repetition patterns to see if there’s a “sweet spot” for that cutoff.