Daft icon indicating copy to clipboard operation
Daft copied to clipboard

[Feature Request] Support reading/writing bloom filters in the parquet file

Open raghumdani opened this issue 1 year ago • 3 comments

Is your feature request related to a problem? Please describe. Today, we do not write bloom filters metadata for each column in parquet files which makes the reads inefficient.

Describe the solution you'd like The existing API to write parquet files can automatically persist bloom filter metadata as well in each parquet file. Our use-case on the read side:

  1. We have a list of column values [a, b, c] for a column C.
  2. We will have N parquet files each having bloom filter stored in the column metadata for column C.
  3. We need a most efficient way to pick rows matching values in 1.

Describe alternatives you've considered Alternative solutions we have considered is to store a schema metadata that is currently supported via PyArrow. However, this metadata cannot be leverages any other readers apart from us.

Additional context Below is sample python code for the APIs we are expecting:

import pyarrow as pa
import daft
from daft import col

table = pa.table([
   pa.array([2, 4, 5, 100]),
   pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
   ], names=['n_legs', 'animals'])

df = daft.from_arrow(table)
file = df.write_parquet("./test.parquet", store_column_bloom_filters=["n_legs"])

new_df = daft.read_parquet(file.to_pydict()['path'][0])

legs_to_filter = [10, 100]
legs_not_in_file = [20, 30]

# pick only rows where n_legs in [10, 100]
# This operation must use bloom filters to prune out irrelevant row groups to read
# Returns only 1 record
new_df.where(col('n_legs').is_in(legs_to_filter)).collect()

# This should not result in a file read at all, but just the bloom filter metadata
new_df.where(col('n_legs').is_in(legs_not_in_file)).collect()

Some resources indicate that Rust libraries of Arrow has support for manipulating bloom filters in parquet.

raghumdani avatar Feb 03 '24 01:02 raghumdani

Thanks @raghumdani !

We're chatting about this internally, will provide an update and estimate of how easy this is to build on our end this Friday

jaychia avatar Feb 07 '24 00:02 jaychia

Any updates on this?

ahmad-axds avatar Jun 07 '24 21:06 ahmad-axds

Hi @ahmad-axds! We decided to deprioritize this because for many use-cases min/max statistics may actually already suffice.

Do you have a use-case for bloom filters that you think would be compelling?

jaychia avatar Jun 11 '24 23:06 jaychia