presto
presto copied to clipboard
[Iceberg] Return predicate columns in table statistics
Description
Returns TableStatistics
instances which include VariableStatsEstimate
s corresponding to the selected columns in the table scan and the columns which are in the table layout predicates.
Motivation and Context
Queries which filter on a column but don't use it in their output might have
the assignment pruned from PruneUnreferencedAssignments
. When the
assignment is pruned from the TableScanNode, it is no longer passed in
the set of selected columns in the TableScanStatsRule
. However, it's
important to include information on columns not in the assignments
(such as those applied only in a predicate) because those additional
columns might be used to compute a more accurate output cardinality.
Specifically we saw this case in TPC-DS Q82 where the estimates for an Iceberg and Hive table scan which had the exact same statistics came back with very different estimates:
Iceberg
- TableScan[PlanNodeId 0][TableHandle {connectorId='iceberg', connectorHandle='item$data@Optional[8250555819003230185]', layout='Optional[item$data@Optional[8250555819003230185]]'}, grouped = false] => [i_item_sk:integer, i_item_id:varchar, i_item_desc:varchar, i_current_price:decimal(7,2)]
Estimates: {source: CostBasedSourceInfo, rows: 80,872 (10.83MB), cpu: 11,350,978.15, memory: 0.00, network: 0.00}
Hive
- TableScan[PlanNodeId 0][TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf1000_parquet_v2, tableName=item, analyzePartitionValues=Optional.empty}', layout='Optional[tpcds_sf1000_parquet_v2.item{domains={i_current_price=[ [["69.00", "99.00"]] ], i_manufact_id=[ [["105"], ["137"], ["180"], ["513"]] ], i_item_sk=[ [(<min>, <max>)] ]}}]'}, grouped = false] => [i_item_sk:integer, i_item_id:varchar(16), i_item_desc:varchar(200), i_current_price:decimal(7,2)]
Estimates: {source: CostBasedSourceInfo, rows: 359 (49.14kB), cpu: 50,320.82, memory: 0.00, network: 0.00}
The approach in this PR mirrors the approach in the Hive connector
https://github.com/prestodb/presto/blob/dcb4ed50bc228b268b021e6a43473bcdeaf2fde6/presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java#L866-L878
Impact
Some additional information is included in the returned TableStatistics. Should help improve join orders.
Test Plan
- New unit test to verify behavior of method which calculates the set of columns to include in the statistics calculations
- Integration test to ensure that the table scan node generated by a query with a set of selected and non-selected columns with filter pushdown enabled and disabled on partitioned and non-partitioned tables should result in the same set of columns generated by the Iceberg metadata's
getTableStatistics
when executing theTableScanStatsRule
Contributor checklist
- [x] Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
- [x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
- [x] Documented new properties (with its default value), SQL syntax, functions, or other functionality.
- [x] If release notes are required, they follow the release notes guidelines.
- [x] Adequate tests were added if applicable.
- [x] CI passed.
Release Notes
== NO RELEASE NOTE ==