polars Skip row_groups in parquet files using bloom

Problem description

parquet2 library has bloom_filters implemented: https://github.com/jorgecarleitao/parquet2/tree/main/src/bloom_filter.

But as far as I can see pruning row_groups by bloom filters isn't implemented in polars. This can provide a significant performance boost when filtering by binary or string columns.

@ritchie46 I can implement this if it makes sense. Don't know if it is already implemented in some way.

Oct 25 '22 14:10 ozgrakkurt

We already use min-max pruning based on statistics. That might be a starting point of your implementation. You probably need to convert a polars expression to something that is acceptable by the bloom filter implementation.

Oct 25 '22 14:10 ritchie46

I just arrived here because I'm trying to read 10GB from ~ 100TB of data with polars (or pyspark, open to either) where I don't know which partition contains the data but I do know the 10GB are mostly in the same ~ 0.1% of the total partitions. All rows have an md5_hash column which could be used together with bloom filters to read almost only the data we need to read. I.e.

we write the data with bloom_filters enabled, based on the MD5
polars reads 2 datasets, the IDs df and the full large df in lazy mode
through a join or some other way of telling polars which row IDs we want, read only <110% of the target data volume (i.e. avoid reading most partitions that we know thanks to bloom filters in parquet files, not to contain any of our ids we care about)

In my mind, what polars would have to do for me is grab all parquet files' footers and check all IDs against all bloom filters stored in the footers. Then not read any partition that we know doesn't contain any data.

Sep 14 '24 16:09 pascalwhoop

I just arrived here because I'm trying to read 10GB from ~ 100TB of data with polars (or pyspark, open to either) where I don't know which partition contains the data but I do know the 10GB are mostly in the same ~ 0.1% of the total partitions. All rows have an md5_hash column which could be used together with bloom filters to read almost only the data we need to read. I.e.

we write the data with bloom_filters enabled, based on the MD5

polars reads 2 datasets, the IDs df and the full large df in lazy mode

through a join or some other way of telling polars which row IDs we want, read only <110% of the target data volume (i.e. avoid reading most partitions that we know thanks to bloom filters in parquet files, not to contain any of our ids we care about)

In my mind, what polars would have to do for me is grab all parquet files' footers and check all IDs against all bloom filters stored in the footers. Then not read any partition that we know doesn't contain any data.

Hey, not sure about your specific application but I would recommend just building external indices for it and not using a query engine for parquet. Engines are very bad at skipping data and loading headers is very slow anyway if you have a significant amount of files.

Sep 14 '24 17:09 ozgrakkurt

mhhhm @ozgrakkurt good point, we're using kedro.org for our pipelining and that sometimes leads to "giving away" the file reading power too much. But I guess we could wrap up custom dataset that selects the files to read first and then reads only those. thx

Oct 14 '24 12:10 pascalwhoop

Hi - did this ever get implemented ? Seems like a really useful way to filter large (multi file) datasets where there is a high variety of strings. I can see some of the Java libraries have done it. It would be a great feature to have for LazyFrame.

The Rust parquet libraries provide everything you need to write the bloom filters. Reading and filtering feels a bit more "roll your own" though - another reason why it would be great to just bundle into Polars.

Jan 16 '25 18:01 MonkeyChap

polars
polars copied to clipboard

Skip row_groups in parquet files using bloom_filters

Problem description

polars polars copied to clipboard

Skip row_groups in parquet files using bloom_filters

Problem description

polars
polars copied to clipboard