databend icon indicating copy to clipboard operation
databend copied to clipboard

BUG: feed estimated number of distinct values to bloom filter

Open dantengsky opened this issue 2 years ago • 1 comments

Summary

currently, the "number of distinct values" we feed to the bloom filter is the number of rows, which is too conservative (naive:). we should use something like hyperloglog to estimate the NDV of give column and then feed the NDV to the bloom filter, for columns that have low cardinality, this will "shrink" the bloom filter index significantly.

https://github.com/datafuselabs/databend/blob/f59efc05bc080e762846fbe8bac24c806e432a75/src/query/storages/index/src/bloom_filter.rs#L151-L162

may also be related to https://github.com/datafuselabs/databend/issues/7314

dantengsky avatar Sep 23 '22 05:09 dantengsky

This may make the bloom filter bitset smaller. ClickHouse also has a bloom filter index, it seems to work like databend, I haven't checked his bitset size yet. cc @drmingdrmer

BohuTANG avatar Sep 23 '22 05:09 BohuTANG

vanilla bloom filter is replaced by xor filter in PR #7870

dantengsky avatar Sep 27 '22 02:09 dantengsky