polars icon indicating copy to clipboard operation
polars copied to clipboard

Skip row_groups in parquet files using bloom_filters

Open ozgrakkurt opened this issue 3 years ago • 4 comments

Problem description

parquet2 library has bloom_filters implemented: https://github.com/jorgecarleitao/parquet2/tree/main/src/bloom_filter.

But as far as I can see pruning row_groups by bloom filters isn't implemented in polars. This can provide a significant performance boost when filtering by binary or string columns.

@ritchie46 I can implement this if it makes sense. Don't know if it is already implemented in some way.

ozgrakkurt avatar Oct 25 '22 14:10 ozgrakkurt

We already use min-max pruning based on statistics. That might be a starting point of your implementation. You probably need to convert a polars expression to something that is acceptable by the bloom filter implementation.

ritchie46 avatar Oct 25 '22 14:10 ritchie46

I just arrived here because I'm trying to read 10GB from ~ 100TB of data with polars (or pyspark, open to either) where I don't know which partition contains the data but I do know the 10GB are mostly in the same ~ 0.1% of the total partitions. All rows have an md5_hash column which could be used together with bloom filters to read almost only the data we need to read. I.e.

  1. we write the data with bloom_filters enabled, based on the MD5
  2. polars reads 2 datasets, the IDs df and the full large df in lazy mode
  3. through a join or some other way of telling polars which row IDs we want, read only <110% of the target data volume (i.e. avoid reading most partitions that we know thanks to bloom filters in parquet files, not to contain any of our ids we care about)

In my mind, what polars would have to do for me is grab all parquet files' footers and check all IDs against all bloom filters stored in the footers. Then not read any partition that we know doesn't contain any data.

pascalwhoop avatar Sep 14 '24 16:09 pascalwhoop

I just arrived here because I'm trying to read 10GB from ~ 100TB of data with polars (or pyspark, open to either) where I don't know which partition contains the data but I do know the 10GB are mostly in the same ~ 0.1% of the total partitions. All rows have an md5_hash column which could be used together with bloom filters to read almost only the data we need to read. I.e.

  1. we write the data with bloom_filters enabled, based on the MD5

  2. polars reads 2 datasets, the IDs df and the full large df in lazy mode

  3. through a join or some other way of telling polars which row IDs we want, read only <110% of the target data volume (i.e. avoid reading most partitions that we know thanks to bloom filters in parquet files, not to contain any of our ids we care about)

In my mind, what polars would have to do for me is grab all parquet files' footers and check all IDs against all bloom filters stored in the footers. Then not read any partition that we know doesn't contain any data.

Hey, not sure about your specific application but I would recommend just building external indices for it and not using a query engine for parquet. Engines are very bad at skipping data and loading headers is very slow anyway if you have a significant amount of files.

ozgrakkurt avatar Sep 14 '24 17:09 ozgrakkurt

mhhhm @ozgrakkurt good point, we're using kedro.org for our pipelining and that sometimes leads to "giving away" the file reading power too much. But I guess we could wrap up custom dataset that selects the files to read first and then reads only those. thx

pascalwhoop avatar Oct 14 '24 12:10 pascalwhoop

Hi - did this ever get implemented ? Seems like a really useful way to filter large (multi file) datasets where there is a high variety of strings. I can see some of the Java libraries have done it. It would be a great feature to have for LazyFrame.

The Rust parquet libraries provide everything you need to write the bloom filters. Reading and filtering feels a bit more "roll your own" though - another reason why it would be great to just bundle into Polars.

MonkeyChap avatar Jan 16 '25 18:01 MonkeyChap