gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

Inline CTEs in Orca that contain outer references

Open chrishajas opened this issue 1 year ago • 1 comments
trafficstars

Currently, Orca falls back to Planner if a CTE is referenced multiple times and contains outer references. This is due to the way Orca creates CTE producers/consumers for shared scans and adds a Materialize above the Sequence operator.

While Orca does support inlining CTEs, by default it only inlines CTEs if the CTE is referenced once or a guc is set to force inlining. With this change, Orca will also attempt to inline a CTE if it contains an outer reference. In the future, Orca may be able support generating a valid shared scan plan for CTEs with outer refs, but that is a much larger effort also involving executor changes.

Consider the following query:

select sum((with cte as (select * from jazz) select 1 from cte cte1, cte cte2 where cte1.h = foo.b)) as t FROM foo

The query tree coming into Orca is the following:

+--CLogicalProject
   |--CLogicalGet "foo" ("foo"), Columns: ["a" (0), "ctid" (1), "xmin" (2), "cmin" (3), "xmax" (4), "cmax" (5), "tableoid" (6), "gp_segment_id" (7)] Key sets: {[1,7]}
   +--CScalarProjectList
      +--CScalarProjectElement "t" (33)
         +--CScalarSubquery["?column?" (32)]
            +--CLogicalCTEAnchor (0)
               +--CLogicalProject
                  |--CLogicalNAryJoin
                  |  |--CLogicalCTEConsumer (0), Columns: ["a" (16)]
                  |  |--CLogicalCTEConsumer (0), Columns: ["a" (24)]
                  |  +--CScalarCmp (=)
                  |     |--CScalarIdent "a" (0)
                  |     +--CScalarConst (2)
                  +--CScalarProjectList
                     +--CScalarProjectElement "?column?" (32)
                        +--CScalarConst (1)

+--CLogicalCTEProducer (0), Columns: ["a" (8)]   rows:1   width:34  rebinds:1
   +--CLogicalGet "jazz" ("jazz"), Columns: ["a" (8), "ctid" (9), "xmin" (10), "cmin" (11), "xmax" (12), "cmax" (13), "tableoid" (14), "gp_segment_id" (15)] Key sets: {[1,7]}   rows:1   width:34  rebinds:1

Note that the outer reference is below the CTE Anchor. If this gets converted to a sequence with a shared scan, a material is required above it--which leads to the invalid plan.

We'll now produce a valid inlined plan instead:

 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..1356691499.25 rows=1 width=4)
   ->  Seq Scan on foo  (cost=0.00..1356691499.25 rows=334 width=4)
         SubPlan 1
           ->  Result  (cost=0.00..1324032.04 rows=1 width=4)
                 ->  Result  (cost=0.00..1324032.04 rows=1 width=1)
                       One-Time Filter: (foo.a = 2)
                       ->  Materialize  (cost=0.00..1324032.04 rows=1 width=1)
                             ->  Broadcast Motion 3:3  (slice2; segments: 3)  (cost=0.00..1324032.04 rows=1 width=1)
                                   ->  Nested Loop  (cost=0.00..1324032.04 rows=1 width=1)
                                         Join Filter: true
                                         ->  Broadcast Motion 3:3  (slice3; segments: 3)  (cost=0.00..431.00 rows=1 width=1)
                                               ->  Seq Scan on jazz jazz_1  (cost=0.00..431.00 rows=1 width=1)
                                         ->  Seq Scan on jazz  (cost=0.00..431.00 rows=1 width=1)
 Optimizer: GPORCA
(14 rows)

Note: The code changes use a Setter to set the outer references rather than passing it through a constructor. This isn't ideal. I initially tried adding this in AddCTEProducer() which adds the CCTEInfoEntry. However, deriving the outer references of the child node requires that the CTEInfoEntry is already populated, thus I had to add this after the fact.

chrishajas avatar Apr 24 '24 22:04 chrishajas