parquet-format Token Bloom filter Support

Describe the enhancement requested

Database like ClickHouse support bloom filters on the tokens present in a String rather than the String itself.

https://clickhouse.com/docs/optimize/skipping-indexes#bloom-filter-types

I suggest that Apache Parquet support this type of Bloom filter to speed up token matching or SQL like operations.

Mar 31 '25 03:03 aadant

FYI: there was a discussion on the dev@parquet ML https://lists.apache.org/thread/jhhxlsq963mx3qs87rtknn0vwdnp79fh

Mar 31 '25 07:03 wgtmac

@wgtmac the pluggable tokenizer is indeed a valid concern but this type of filter is used for alpha-numeric tokens separated by space mostly machine generated.

Mar 31 '25 18:03 aadant

Do you want to raise this at [email protected]? I'm afraid that there isn't enough audience here. @aadant

Apr 03 '25 14:04 wgtmac

In my opinion, adding new index-like structures to the parquet spec makes sense when a "large" number of engines will support writing and using them.

Today it is possible to use such index structures without changing the spec in at least two ways:

Store the index in outside of the parquet files themselves (e.g. in a metadata store). Here is an example of using an external index in Apache DataFusion
Store the index in the user defined metadata (e.g key/value metadata)

My suggestion is to postpone any changes to the parquet spec until there are several engines that use this type of index with parquet already.

Apr 27 '25 12:04 alamb

I believe @emkornfield expresses similar sentiments in his response on the mailing list: https://lists.apache.org/thread/r2xfqk9kx974hhh23zr06jy80dvlhnmd

May 01 '25 12:05 alamb

Today it is possible to use such index structures without changing the spec in at least two ways:

I wrote some blogs about how this process works. You can find them here:

You can put such indices in Parquet files, as described here

https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/

Or you can store such indices externally, as described here:

https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/

Aug 18 '25 10:08 alamb