trino
trino copied to clipboard
Add bloom filter write support to ParquetWriter
Description
The bloom filters are added after all the row groups, right before the footer, similar to the first option described here.
We do not support writing bloom filters for types for which we do not have read support.
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required. ( ) Release notes are required. Please propose a release note for me. ( ) Release notes are required, with the following suggested text:
# Section
* Fix some things. ({issue}`issuenumber`)
There remains some work to decide on how to configure the bloom filters in Hive (and also Iceberg and Delta). We could do the following: For the Hive table properties:
parquet.bloom.filter.enabled#<column-name>=true
parquet.bloom.filter.fpp#<column-name>=0.1 # double between 0 and 1
parquet.bloom.filter.expected.ndv#<column-name>=1 # integer between 1 and Long.MAX_VALUE
These are the same properties used by parquet-mr: https://github.com/apache/parquet-mr/blob/20d43639b5a380335953742ad6c9b3dd98e09f29/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L152-L155
If any of the Hive table properties have invalid values we treat them as unspecified. parquet.bloom.filter.enabled#<column-name>=true must always be specified for the others table properties to take effect.
For the Iceberg table properties we will likely want to do writer.parquet.bloom-filter-enabled.column.<column-name>, and similar for the other properties, https://iceberg.apache.org/docs/latest/configuration/#write-properties
For the Trino table properties we define:
parquet_bloom_filter_enabled, map(varchar, boolean) = MAP(ARRAY['<column-name>'], ARRAY[<enabled>]]
parquet_bloom_filter_fpp, map(varchar, double) = MAP(ARRAY['<column-name>'], ARRAY[<fpp>]]
parquet_bloom_filter_ndv, map(varchar, bigint) = MAP[ARRAY['<column-name>'], ARRAY[<ndv>]]
parquet_bloom_filters_enabled, boolean. It enables bloom filters for all columns which can support it. If parquet_bloom_filter_columns is also specified, the entries of parquet_bloom_filter_enabled takes precedence.
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua
I've pushed some minor fixups, please apply them into the appropriate commits