[Feature Request] Support reading/writing bloom filters in the parquet file
Is your feature request related to a problem? Please describe. Today, we do not write bloom filters metadata for each column in parquet files which makes the reads inefficient.
Describe the solution you'd like The existing API to write parquet files can automatically persist bloom filter metadata as well in each parquet file. Our use-case on the read side:
- We have a list of column values
[a, b, c]for a columnC. - We will have
Nparquet files each having bloom filter stored in the column metadata for columnC. - We need a most efficient way to pick rows matching values in 1.
Describe alternatives you've considered Alternative solutions we have considered is to store a schema metadata that is currently supported via PyArrow. However, this metadata cannot be leverages any other readers apart from us.
Additional context Below is sample python code for the APIs we are expecting:
import pyarrow as pa
import daft
from daft import col
table = pa.table([
pa.array([2, 4, 5, 100]),
pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
], names=['n_legs', 'animals'])
df = daft.from_arrow(table)
file = df.write_parquet("./test.parquet", store_column_bloom_filters=["n_legs"])
new_df = daft.read_parquet(file.to_pydict()['path'][0])
legs_to_filter = [10, 100]
legs_not_in_file = [20, 30]
# pick only rows where n_legs in [10, 100]
# This operation must use bloom filters to prune out irrelevant row groups to read
# Returns only 1 record
new_df.where(col('n_legs').is_in(legs_to_filter)).collect()
# This should not result in a file read at all, but just the bloom filter metadata
new_df.where(col('n_legs').is_in(legs_not_in_file)).collect()
Some resources indicate that Rust libraries of Arrow has support for manipulating bloom filters in parquet.
Thanks @raghumdani !
We're chatting about this internally, will provide an update and estimate of how easy this is to build on our end this Friday
Any updates on this?
Hi @ahmad-axds! We decided to deprioritize this because for many use-cases min/max statistics may actually already suffice.
Do you have a use-case for bloom filters that you think would be compelling?