parquet-format
parquet-format copied to clipboard
N-gram Bloom Filter Support
Describe the enhancement requested
Some database support bloom filters on n-grams from a String. This facilitates some operations like "like" operations. Searching for a particular discriminant token with rare n-grams can be greatly sped up.
https://clickhouse.com/docs/optimize/skipping-indexes#bloom-filter-types
see ngrambf_v1
this one is language independent @wgtmac
- See also https://github.com/apache/parquet-format/issues/489#issuecomment-2833438755
BTW it is possible today to add user defined indexes for this usecase
You can put such indices in Parquet files, as described here
- https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/
Or you can store such indices externally, as described here:
- https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/