presto icon indicating copy to clipboard operation
presto copied to clipboard

Improve error handling for partial aggregation pushdown

Open rschlussel opened this issue 11 months ago • 1 comments

Improve error handling for partial aggregation pushdown and prevent returning wrong results when footer stats should not be relied on. This covers the following cases:

  1. Aggregations have been pushed down but partition file format does not support aggregation pushdown (can occur if table is declared with a supported storage format, but partition has a different storage format). Previously, page source providers for some file formats had special handling for this case, but not all
  2. Always throw an exception if aggregations have been pushed down but partition footer stats are unreliable. Previously, if filter pushdown was enabled (used OrcSelectivePageSourceFactory), we wouldn't create an AggregatedPageSource, so you would get an error somewhere on read. If it was disabled (OrcBatchPageSourceFactory), we would create an AggregatedPageSource and the query would silently give wrong results.
  3. Unexpected state where some but not all columns are of AGGREGATED type.

Error handling is still going to be reader dependent if both the table and partition format support partial aggregation pushdown, but the partition format does not support as many types (e.g. currently supports more types for partial aggregation pushdown).

Description

Previously AggregatedPageSources (which support the execution side of partial aggregation pushdown) were created from within the selective and batch page source factories of supported file formats. Similarly error handling for any unsupported file format needed to be repeated for each PageSourceFactory of all unsupported file formats. This resulted in a fragmented implementation and some unsupported file formats that did not include proper error handling.

Additionally, partial aggregation pushdown cannot be used when footer stats are unreliable, however handling for this was only added for one of the supported file formats factories (OrcSelectivePageSourceFactory) while others (orc and parquet batch factories) could silently return wrong results. Furthermore, the handling in OrcSelectivePageSourceFactory prevented wrong results by not creating an aggregated page source but didn't produce a clear error message because it kept going by trying to create a selective page source.

This PR makes HiveAggregatedPageSourceFactories into a top-level concept similar to HiveSelectivePageSourceFactories and HiveBatchPageSourceFactories so that we can unify all the error handling and prevent bugs from creeping in as new file format page source factories are added.
The main logic of the change is in HivePageSourceProvider. A lot of the rest of it is scaffolding to support that.

Motivation and Context

The motivation is twofold

  1. to ensure consistent error handling across different page sources even as new page formats or selective readers implementations are added.
  2. To prevent wrong results when footer stats are unreliable regardless of file format and any other configs.

This gap was discovered as part of an audit to make sure we were not assuming that partition file formats will always match table file formats.

Impact

Fix a potential wrong results bug when footer stats are marked as unreliable and aggregation pushdown is enabled.

Test Plan

new unit tests for HivePageSourceProvider

Contributor checklist

  • [X] Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • [X] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • [X] Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • [X] If release notes are required, they follow the release notes guidelines.
  • [X] Adequate tests were added if applicable.
  • [ ] CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==
Hive Changes
* Fix a potential wrong results bug when footer stats are marked unreliable and partial aggregation pushdown is enabled.  Such queries will now fail with an error.

rschlussel avatar Feb 26 '24 21:02 rschlussel

I'm still updating the tests. Don't review yet

rschlussel avatar Feb 26 '24 21:02 rschlussel

this is ready for review (failing tests are flaky/unrelated)

rschlussel avatar Feb 27 '24 16:02 rschlussel

thanks for review @abhiseksaikia and @ClarenceThreepwood. I've addressed your comments. I also split out the commits a bit as per request from @ajaygeorge.

rschlussel avatar Mar 04 '24 18:03 rschlussel