presto icon indicating copy to clipboard operation
presto copied to clipboard

EXPLAIN (type DISTRIBUTED) on CTE scans always shows stats as 0

Open ZacBlanco opened this issue 4 months ago • 0 comments

When executing queries with CTEs, cost estimates are lost after the physical optimizer replaces CTE producer and consumer nodes with the TableScan/Write/Commit operators. Any cost-based rules after the PhysicalCteOptimizer are introduced might not get any stats from the table scan nodes which read from the temp tables.

It can also make debugging more difficult when reading distributed-formatted plans.

Expected Behavior

  1. Stats for scans from CTEs should get stats properly.
  2. DISTRIBUTED-formatted plans should accurately show the stats for the CTE scans.

Current Behavior

EXPLAIN (format DISTRIBUTED) queries with CTEs enabled show output row counts and size as 0.

Example scan from TPC-DS q30

- ScanProject[PlanNodeId 2014,2468][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=__temp_ctes__, tableName=__presto_temporary_table_PARQUET_20241009_233847_00073_s2xu9_f02e48b8_520e_47d4_ae97_ab1a2090>
             Estimates: {source: CostBasedSourceInfo, rows: 0 (0B), cpu: 0.00, memory: 0.00, network: 0.00}/{source: CostBasedSourceInfo, rows: 0 (0B), cpu: 0.00, memory: 0.00, network: 0.00}                                                     >
             $hashvalue_229 := combine_hash(BIGINT'0', COALESCE($operator$hash_code(ca_state_206), BIGINT'0')) (2:85)                                                                                                                               >
             LAYOUT: __temp_ctes__.__presto_temporary_table_parquet_20241009_233847_00073_s2xu9_f02e48b8_520e_47d4_ae97_ab1a209071e0{buckets=128}                                                                                                   >
             sum_207 := _c2_sum:decimal(38,2):2:REGULAR (2:108)                                                                                                                                                                                     >
             ca_state_206 := _c1_ca_state:char(2):1:REGULAR (2:327)                                                                                                                                                                                 >
             wr_returning_customer_sk_205 := _c0_wr_returning_customer_sk:bigint:0:REGULAR (2:300)

Possible Solution

Potentially store some kind of context alongside inside TableScan when generating it during the PhysicalCteOptimizer so that the stats can be retrieved in the TableScanStatsRule

Steps to Reproduce

Start the HiveQueryRunner and set the following session properties

CREATE SCHEMA hive.__temp_ctes___;
SET SESSION cte_materialization_strategy = 'ALL';
SET SESSION cte_partitioning_provider_catalog='hive';
SET SESSION hive.temporary_table_storage_format = 'PARQUET';
SET SESSION hive.temporary_table_schema = '__temp_ctes__';

Run an EXPLAIN query with CTEs:

EXPLAIN (type DISTRIBUTED) WITH t as (SELECT * FROM (VALUES 1, 2, 3)) SELECT * FROM t t1;

Table scans on CTE nodes should show estimates as 0.

 Fragment 1 [hive:buckets=128, bucketFunctionType=HIVE_COMPATIBLE, types=[string]]                                                                                                                                                                  >
     Output layout: [field_11]                                                                                                                                                                                                                      >
     Output partitioning: SINGLE []                                                                                                                                                                                                                 >
     Stage Execution Strategy: UNGROUPED_EXECUTION                                                                                                                                                                                                  >
     - TableScan[PlanNodeId 380][TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=__temp_ctes__, tableName=__presto_temporary_table_PARQUET_20241009_234614_00079_s2xu9_c2a33274_7a8a_413d_857b_dac539b8a995, analyzePar>
             Estimates: {source: CostBasedSourceInfo, rows: 0 (0B), cpu: 0.00, memory: 0.00, network: 0.00}                                                                                                                                         >
             LAYOUT: __temp_ctes__.__presto_temporary_table_parquet_20241009_234614_00079_s2xu9_c2a33274_7a8a_413d_857b_dac539b8a995{buckets=128}                                                                                                   >
             field_11 := _c0_field:int:0:REGULAR (1:54)

Screenshots (if appropriate)

Context

ZacBlanco avatar Oct 10 '24 00:10 ZacBlanco