trino icon indicating copy to clipboard operation
trino copied to clipboard

Implement verification for optimized parquet writer

Open raunaqmorarka opened this issue 3 years ago • 0 comments

Description

Implements verification of file footer, row count and checksum of columns. Added a config parquet.optimized-writer.validation-percentage and session property in hive connector to control the percentage of written files that will be verified. By default, we will verify 5% of written files.

Is this change a fix, improvement, new feature, refactoring, or other?

new feature

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

optimized parquet writer

How would you describe this change to a non-technical end user or system administrator?

Implements verification of files written by optimized parquet writer in hive connector.

Fixes https://github.com/trinodb/trino/issues/5356

Documentation

( ) No documentation is needed. (x) Sufficient documentation is included in this PR. ( ) Documentation PR is available with #prnumber. ( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required. (x) Release notes entries required with the following suggested text:

# Hive
* Implement verification of files written by optimized parquet writer. ({issue}`13246`)

raunaqmorarka avatar Jul 20 '22 08:07 raunaqmorarka

  1. there seems to ba a lot of code copied between orc, rcfile and parquet write validation. It would be a lot cleaner to have it extracted to common place.
  2. This changes modifies ParquetReader and related classes significantly, which in theory could be avoided (may be hard in practice though).

Most of the repetition is around calculating column checksums. It's not a lot of code, there are minor differences with orc and sharing code between the 3 readers doesn't seem to be worth the effort. The refactorings around ParquetReader and related classes were needed to avoid making the file validation code depend on ConnectorPageSource (simplifies setup of validationInputFactory and reduces connector specific code) and based on reviewer feedback about making the code closer to ORC.

raunaqmorarka avatar Aug 12 '22 17:08 raunaqmorarka

Optimized parquet writer verification inserts benchmark.pdf Perf impact with 5% verification (current default) is around 2-3% Perf impact with 100% verification would be around 45%.

raunaqmorarka avatar Sep 12 '22 16:09 raunaqmorarka

Per https://github.com/trinodb/trino/issues/14047#issuecomment-1244866545 is this enabled in Hive connector only, and Iceberg/Delta (which also use the optimizer writer), do not run the verification?

findepi avatar Sep 13 '22 10:09 findepi

Per #14047 (comment) is this enabled in Hive connector only, and Iceberg/Delta (which also use the optimizer writer), do not run the verification?

Right, this PR implements parquet writer verification only for hive connector. It should be straightforward to add it to delta lake and iceberg as well, but I don't know if we actually want to run it there given that we've already been using the new writer without verification in those connectors.

raunaqmorarka avatar Sep 13 '22 10:09 raunaqmorarka