parquet-format
parquet-format copied to clipboard
Token Bloom filter Support
Describe the enhancement requested
Database like ClickHouse support bloom filters on the tokens present in a String rather than the String itself.
https://clickhouse.com/docs/optimize/skipping-indexes#bloom-filter-types
I suggest that Apache Parquet support this type of Bloom filter to speed up token matching or SQL like operations.
FYI: there was a discussion on the dev@parquet ML https://lists.apache.org/thread/jhhxlsq963mx3qs87rtknn0vwdnp79fh
@wgtmac the pluggable tokenizer is indeed a valid concern but this type of filter is used for alpha-numeric tokens separated by space mostly machine generated.
Do you want to raise this at [email protected]? I'm afraid that there isn't enough audience here. @aadant
In my opinion, adding new index-like structures to the parquet spec makes sense when a "large" number of engines will support writing and using them.
Today it is possible to use such index structures without changing the spec in at least two ways:
- Store the index in outside of the parquet files themselves (e.g. in a metadata store). Here is an example of using an external index in Apache DataFusion
- Store the index in the user defined metadata (e.g key/value metadata)
My suggestion is to postpone any changes to the parquet spec until there are several engines that use this type of index with parquet already.
I believe @emkornfield expresses similar sentiments in his response on the mailing list: https://lists.apache.org/thread/r2xfqk9kx974hhh23zr06jy80dvlhnmd
Today it is possible to use such index structures without changing the spec in at least two ways:
I wrote some blogs about how this process works. You can find them here:
You can put such indices in Parquet files, as described here
- https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/
Or you can store such indices externally, as described here:
- https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/