presto icon indicating copy to clipboard operation
presto copied to clipboard

[Iceberg] Return predicate columns in table statistics

Open ZacBlanco opened this issue 7 months ago • 2 comments

Description

Returns TableStatistics instances which include VariableStatsEstimates corresponding to the selected columns in the table scan and the columns which are in the table layout predicates.

Motivation and Context

Queries which filter on a column but don't use it in their output might have the assignment pruned from PruneUnreferencedAssignments. When the assignment is pruned from the TableScanNode, it is no longer passed in the set of selected columns in the TableScanStatsRule. However, it's important to include information on columns not in the assignments (such as those applied only in a predicate) because those additional columns might be used to compute a more accurate output cardinality.

Specifically we saw this case in TPC-DS Q82 where the estimates for an Iceberg and Hive table scan which had the exact same statistics came back with very different estimates:

Iceberg
- TableScan[PlanNodeId 0][TableHandle {connectorId='iceberg', connectorHandle='item$data@Optional[8250555819003230185]', layout='Optional[item$data@Optional[8250555819003230185]]'}, grouped = false] => [i_item_sk:integer, i_item_id:varchar, i_item_desc:varchar, i_current_price:decimal(7,2)]
            Estimates: {source: CostBasedSourceInfo, rows: 80,872 (10.83MB), cpu: 11,350,978.15, memory: 0.00, network: 0.00}
Hive
- TableScan[PlanNodeId 0][TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf1000_parquet_v2, tableName=item, analyzePartitionValues=Optional.empty}', layout='Optional[tpcds_sf1000_parquet_v2.item{domains={i_current_price=[ [["69.00", "99.00"]] ], i_manufact_id=[ [["105"], ["137"], ["180"], ["513"]] ], i_item_sk=[ [(<min>, <max>)] ]}}]'}, grouped = false] => [i_item_sk:integer, i_item_id:varchar(16), i_item_desc:varchar(200), i_current_price:decimal(7,2)]
            Estimates: {source: CostBasedSourceInfo, rows: 359 (49.14kB), cpu: 50,320.82, memory: 0.00, network: 0.00}

The approach in this PR mirrors the approach in the Hive connector

https://github.com/prestodb/presto/blob/dcb4ed50bc228b268b021e6a43473bcdeaf2fde6/presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java#L866-L878

Impact

Some additional information is included in the returned TableStatistics. Should help improve join orders.

Test Plan

  • New unit test to verify behavior of method which calculates the set of columns to include in the statistics calculations
  • Integration test to ensure that the table scan node generated by a query with a set of selected and non-selected columns with filter pushdown enabled and disabled on partitioned and non-partitioned tables should result in the same set of columns generated by the Iceberg metadata's getTableStatistics when executing the TableScanStatsRule

Contributor checklist

  • [x] Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • [x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • [x] Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • [x] If release notes are required, they follow the release notes guidelines.
  • [x] Adequate tests were added if applicable.
  • [x] CI passed.

Release Notes

== NO RELEASE NOTE ==

ZacBlanco avatar Jun 27 '24 06:06 ZacBlanco