trino icon indicating copy to clipboard operation
trino copied to clipboard

Add Parquet bloom filter write support to Iceberg connector

Open jkylling opened this issue 1 year ago • 5 comments

Description

Additional context and related issues

Part of https://github.com/trinodb/trino/issues/21570

Release notes

( ) This is not user-visible or is docs only, and no release notes are required. ( ) Release notes are required. Please propose a release note for me. (x) Release notes are required, with the following suggested text:

# Section
* Support writing Parquet Bloom filters in Iceberg connector. ({issue}`21570`)

jkylling avatar Apr 18 '24 07:04 jkylling

@leetcode-1533 FYI

findinpath avatar Apr 18 '24 09:04 findinpath

@jkylling gentle reminder about going forward with this contribution.

findinpath avatar May 03 '24 09:05 findinpath

Added a product test which tests that Trino and Spark can read the Iceberg tables written by each other when the Bloom filter table properties are set. I've not verified if the files written by Spark contain Bloom filters.

Here's a little rant about the experience of writing product tests for this, with the hope that it might help improve the experience (there were more steps involved than the ones below):

  • I skim through the readme and figure out the command I probably need to run.
  • ./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
  • The command crashes as I don't have the most recent images and have to pull them, but the download times out for the larger images when done through the product test launcher. I fix it by pulling with docker pull manually.
  • ./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
  • The command crashes as I'm running on Linux with Podman instead of Docker, where it seems to be disallowed to disable the OOM killer. So I need to do my usual workaround of flipping the boolean in https://github.com/trinodb/trino/blob/bd7d3a3e797a49a23ce3f7dfe2af621a4ec6fe34/testing/trino-product-tests-launcher/src/main/java/io/trino/tests/product/launcher/env/Environment.java#L511
  • ./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
  • The commad crashes because some container(s) are failing health checks at startup. I'm not sure which so I edit the launcher to list which containers fail.
  • ./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
  • The command crashes because of a java.net.NoRouteToHostException: No route to host in the Spark container.
  • ./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
  • The commad crashes because some containers are failing health checks at startup. It's the Presto and Hadoop containers. No mention of the Spark container.
  • ./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
  • The commad crashes because some container(s) are failing health checks at startup. It's the Presto and Hadoop containers. I notice the log line logs of container healthcheck: null. Nothing else in the logs seem suspicious, so I decide to just disable all the code of the product tests checking container health checks.
  • ./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
  • Test starts to run! Fails on a syntax error in the test SQL. I fix the SQL syntax error.
  • ./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
  • The command crashes because of a java.net.NoRouteToHostException: No route to host in the Spark container.
  • ./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
  • Test starts to run! New syntax error in the test SQL. I fix the SQL syntax error.
  • ./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
  • The command crashes because of a java.net.NoRouteToHostException: No route to host in the Spark container.
  • ./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
  • Test succeeds!

jkylling avatar May 03 '24 10:05 jkylling

@raunaqmorarka , @jkylling sorry for delaying this work with my comments.

The scaffolding needed for getting query stats is not present at the moment in the product tests.

@jkylling pls wrap the work (I think SHOW CREATE TABLE part is still open) and we'll do both the query stats testing manually via the product test environment (and create a follow-up ticket to cover this topic with automated tests). (optional) I still feel a cheap option would be to create an iceberg table with case sensitive column names in spark and add its contents as test resources and check the query stats within trino-iceberg (similarly to what we have done already for timetravel queries).

findinpath avatar May 15 '24 09:05 findinpath

@raunaqmorarka , @jkylling sorry for delaying this work with my comments.

The scaffolding needed for getting query stats is not present at the moment in the product tests.

@jkylling pls wrap the work (I think SHOW CREATE TABLE part is still open) and we'll do both the query stats testing manually via the product test environment (and create a follow-up ticket to cover this topic with automated tests). (optional) I still feel a cheap option would be to create an iceberg table with case sensitive column names in spark and add its contents as test resources and check the query stats within trino-iceberg (similarly to what we have done already for timetravel queries).

SHOW CREATE TABLE got fixed after your comments :) Please see TestIcebergParquetWithBloomFilters.testBloomFilterPropertiesArePersistedDuringCreate.

jkylling avatar May 15 '24 10:05 jkylling

@jkylling this will improve the read from table with bloom filter or that it only deal with creating bloom filter?

shohamyamin avatar Jun 03 '24 19:06 shohamyamin

@jkylling this will improve the read from table with bloom filter or that it only deal with creating bloom filter?

@shohamyamin This only adds write support. Read support for Bloom filters were already added in 406. Please see https://github.com/trinodb/trino/issues/9471 for the issue which tracked this.

jkylling avatar Jun 03 '24 21:06 jkylling