trino Add bloom filter write support to ParquetWriter

Description

The bloom filters are added after all the row groups, right before the footer, similar to the first option described here.

We do not support writing bloom filters for types for which we do not have read support.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required. ( ) Release notes are required. Please propose a release note for me. ( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

Feb 12 '24 15:02 jkylling

There remains some work to decide on how to configure the bloom filters in Hive (and also Iceberg and Delta). We could do the following: For the Hive table properties:

parquet.bloom.filter.enabled#<column-name>=true
parquet.bloom.filter.fpp#<column-name>=0.1 # double between 0 and 1
parquet.bloom.filter.expected.ndv#<column-name>=1 # integer between 1 and Long.MAX_VALUE

These are the same properties used by parquet-mr: https://github.com/apache/parquet-mr/blob/20d43639b5a380335953742ad6c9b3dd98e09f29/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L152-L155

If any of the Hive table properties have invalid values we treat them as unspecified. parquet.bloom.filter.enabled#<column-name>=true must always be specified for the others table properties to take effect.

For the Iceberg table properties we will likely want to do writer.parquet.bloom-filter-enabled.column.<column-name>, and similar for the other properties, https://iceberg.apache.org/docs/latest/configuration/#write-properties

For the Trino table properties we define:

parquet_bloom_filter_enabled, map(varchar, boolean) = MAP(ARRAY['<column-name>'], ARRAY[<enabled>]]
parquet_bloom_filter_fpp, map(varchar, double) = MAP(ARRAY['<column-name>'], ARRAY[<fpp>]]
parquet_bloom_filter_ndv, map(varchar, bigint) = MAP[ARRAY['<column-name>'], ARRAY[<ndv>]]
parquet_bloom_filters_enabled, boolean. It enables bloom filters for all columns which can support it. If parquet_bloom_filter_columns is also specified, the entries of parquet_bloom_filter_enabled takes precedence.

Feb 15 '24 22:02 jkylling

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

Mar 08 '24 17:03 github-actions[bot]

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

Apr 04 '24 17:04 github-actions[bot]

I've pushed some minor fixups, please apply them into the appropriate commits

Apr 11 '24 11:04 raunaqmorarka

trino trino copied to clipboard

Add bloom filter write support to ParquetWriter

Description

Additional context and related issues

Release notes

trino
trino copied to clipboard