parquet-format icon indicating copy to clipboard operation
parquet-format copied to clipboard

Token Bloom filter Support

Open aadant opened this issue 8 months ago • 5 comments

Describe the enhancement requested

Database like ClickHouse support bloom filters on the tokens present in a String rather than the String itself.

https://clickhouse.com/docs/optimize/skipping-indexes#bloom-filter-types

I suggest that Apache Parquet support this type of Bloom filter to speed up token matching or SQL like operations.

aadant avatar Mar 31 '25 03:03 aadant

FYI: there was a discussion on the dev@parquet ML https://lists.apache.org/thread/jhhxlsq963mx3qs87rtknn0vwdnp79fh

wgtmac avatar Mar 31 '25 07:03 wgtmac

@wgtmac the pluggable tokenizer is indeed a valid concern but this type of filter is used for alpha-numeric tokens separated by space mostly machine generated.

aadant avatar Mar 31 '25 18:03 aadant

Do you want to raise this at [email protected]? I'm afraid that there isn't enough audience here. @aadant

wgtmac avatar Apr 03 '25 14:04 wgtmac

In my opinion, adding new index-like structures to the parquet spec makes sense when a "large" number of engines will support writing and using them.

Today it is possible to use such index structures without changing the spec in at least two ways:

  1. Store the index in outside of the parquet files themselves (e.g. in a metadata store). Here is an example of using an external index in Apache DataFusion
  2. Store the index in the user defined metadata (e.g key/value metadata)

My suggestion is to postpone any changes to the parquet spec until there are several engines that use this type of index with parquet already.

alamb avatar Apr 27 '25 12:04 alamb

I believe @emkornfield expresses similar sentiments in his response on the mailing list: https://lists.apache.org/thread/r2xfqk9kx974hhh23zr06jy80dvlhnmd

alamb avatar May 01 '25 12:05 alamb

Today it is possible to use such index structures without changing the spec in at least two ways:

I wrote some blogs about how this process works. You can find them here:

You can put such indices in Parquet files, as described here

  • https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/

Or you can store such indices externally, as described here:

  • https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/

alamb avatar Aug 18 '25 10:08 alamb