polars
polars copied to clipboard
Skip row_groups in parquet files using bloom_filters
Problem description
parquet2 library has bloom_filters implemented: https://github.com/jorgecarleitao/parquet2/tree/main/src/bloom_filter.
But as far as I can see pruning row_groups by bloom filters isn't implemented in polars. This can provide a significant performance boost when filtering by binary or string columns.
@ritchie46 I can implement this if it makes sense. Don't know if it is already implemented in some way.
We already use min-max pruning based on statistics. That might be a starting point of your implementation. You probably need to convert a polars expression to something that is acceptable by the bloom filter implementation.
I just arrived here because I'm trying to read 10GB from ~ 100TB of data with polars (or pyspark, open to either) where I don't know which partition contains the data but I do know the 10GB are mostly in the same ~ 0.1% of the total partitions. All rows have an md5_hash column which could be used together with bloom filters to read almost only the data we need to read. I.e.
- we write the data with bloom_filters enabled, based on the MD5
- polars reads 2 datasets, the IDs df and the full large df in lazy mode
- through a join or some other way of telling polars which row IDs we want, read only <110% of the target data volume (i.e. avoid reading most partitions that we know thanks to bloom filters in parquet files, not to contain any of our ids we care about)
In my mind, what polars would have to do for me is grab all parquet files' footers and check all IDs against all bloom filters stored in the footers. Then not read any partition that we know doesn't contain any data.
I just arrived here because I'm trying to read 10GB from ~ 100TB of data with polars (or pyspark, open to either) where I don't know which partition contains the data but I do know the 10GB are mostly in the same ~ 0.1% of the total partitions. All rows have an
md5_hashcolumn which could be used together with bloom filters to read almost only the data we need to read. I.e.
we write the data with bloom_filters enabled, based on the MD5
polars reads 2 datasets, the IDs df and the full large df in lazy mode
through a join or some other way of telling polars which row IDs we want, read only <110% of the target data volume (i.e. avoid reading most partitions that we know thanks to bloom filters in parquet files, not to contain any of our ids we care about)
In my mind, what polars would have to do for me is grab all parquet files' footers and check all IDs against all bloom filters stored in the footers. Then not read any partition that we know doesn't contain any data.
Hey, not sure about your specific application but I would recommend just building external indices for it and not using a query engine for parquet. Engines are very bad at skipping data and loading headers is very slow anyway if you have a significant amount of files.
mhhhm @ozgrakkurt good point, we're using kedro.org for our pipelining and that sometimes leads to "giving away" the file reading power too much. But I guess we could wrap up custom dataset that selects the files to read first and then reads only those. thx
Hi - did this ever get implemented ? Seems like a really useful way to filter large (multi file) datasets where there is a high variety of strings. I can see some of the Java libraries have done it. It would be a great feature to have for LazyFrame.
The Rust parquet libraries provide everything you need to write the bloom filters. Reading and filtering feels a bit more "roll your own" though - another reason why it would be great to just bundle into Polars.