parquet-format icon indicating copy to clipboard operation
parquet-format copied to clipboard

N-gram Bloom Filter Support

Open aadant opened this issue 8 months ago • 2 comments

Describe the enhancement requested

Some database support bloom filters on n-grams from a String. This facilitates some operations like "like" operations. Searching for a particular discriminant token with rare n-grams can be greatly sped up.

https://clickhouse.com/docs/optimize/skipping-indexes#bloom-filter-types

see ngrambf_v1

aadant avatar Mar 31 '25 03:03 aadant

this one is language independent @wgtmac

aadant avatar Mar 31 '25 18:03 aadant

  • See also https://github.com/apache/parquet-format/issues/489#issuecomment-2833438755

alamb avatar Apr 27 '25 12:04 alamb

BTW it is possible today to add user defined indexes for this usecase

You can put such indices in Parquet files, as described here

  • https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/

Or you can store such indices externally, as described here:

  • https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/

alamb avatar Aug 18 '25 10:08 alamb