trino icon indicating copy to clipboard operation
trino copied to clipboard

Add bloom filter write support to ParquetWriter

Open jkylling opened this issue 1 year ago • 1 comments

Description

The bloom filters are added after all the row groups, right before the footer, similar to the first option described here.

We do not support writing bloom filters for types for which we do not have read support.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required. ( ) Release notes are required. Please propose a release note for me. ( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

jkylling avatar Feb 12 '24 15:02 jkylling

There remains some work to decide on how to configure the bloom filters in Hive (and also Iceberg and Delta). We could do the following: For the Hive table properties:

parquet.bloom.filter.enabled#<column-name>=true
parquet.bloom.filter.fpp#<column-name>=0.1 # double between 0 and 1
parquet.bloom.filter.expected.ndv#<column-name>=1 # integer between 1 and Long.MAX_VALUE

These are the same properties used by parquet-mr: https://github.com/apache/parquet-mr/blob/20d43639b5a380335953742ad6c9b3dd98e09f29/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L152-L155

If any of the Hive table properties have invalid values we treat them as unspecified. parquet.bloom.filter.enabled#<column-name>=true must always be specified for the others table properties to take effect.

For the Iceberg table properties we will likely want to do writer.parquet.bloom-filter-enabled.column.<column-name>, and similar for the other properties, https://iceberg.apache.org/docs/latest/configuration/#write-properties

For the Trino table properties we define:

parquet_bloom_filter_enabled, map(varchar, boolean) = MAP(ARRAY['<column-name>'], ARRAY[<enabled>]]
parquet_bloom_filter_fpp, map(varchar, double) = MAP(ARRAY['<column-name>'], ARRAY[<fpp>]]
parquet_bloom_filter_ndv, map(varchar, bigint) = MAP[ARRAY['<column-name>'], ARRAY[<ndv>]]
parquet_bloom_filters_enabled, boolean. It enables bloom filters for all columns which can support it. If parquet_bloom_filter_columns is also specified, the entries of parquet_bloom_filter_enabled takes precedence.

jkylling avatar Feb 15 '24 22:02 jkylling

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

github-actions[bot] avatar Mar 08 '24 17:03 github-actions[bot]

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

github-actions[bot] avatar Apr 04 '24 17:04 github-actions[bot]

I've pushed some minor fixups, please apply them into the appropriate commits

raunaqmorarka avatar Apr 11 '24 11:04 raunaqmorarka