gpdb Explain provides wrong rows number when flow is missing

trafficstars

When we ask GPDB to explain a query, it uses flow to count rows number, and it can be either a single segment command, a command that can be executed on all segments or on limited number of segments in general cases. But there are cases when the flow is not generated, so GPDB uses segments count by default, which can be incorrect for some nodes, which are executed on a single segment only, for example.

DROP TABLE IF EXISTS lab1;
CREATE TABLE lab1 (id int, descr text) WITH (appendonly=true) DISTRIBUTED BY (id);
INSERT INTO lab1 SELECT generate_series(1,5000) AS id, md5(random()::text) AS descr;
explain analyze select t1.id, row_number() over (order by t1.id desc) from lab1 t1 join lab1 t2 using(descr);

If we explain this query, we will get an output like this:

   Order By: lab1.id
   ->  Gather Motion 2:1  (slice2; segments: 2)  (cost=0.00..8179.15 rows=25000000 width=4) (actual time=7877.092..12694.123 rows=25000000 loops=1)
         Merge Key: lab1.id
         ->  Sort  (cost=0.00..7730.15 rows=12500000 width=4) (actual time=7874.529..9583.758 rows=12730000 loops=1)
               Sort Key: lab1.id
               Sort Method:  external merge  Disk: 195520kB
               ->  Hash Join  (cost=0.00..1046.52 rows=12500000 width=4) (actual time=6.390..1543.301 rows=12730000 loops=1)
                     Hash Cond: (lab1.descr = lab1_1.descr)
                     Extra Text: (seg1)   Hash chain length 5000.0 avg, 5000 max, using 1 of 131072 buckets.
                     ->  Seq Scan on lab1  (cost=0.00..431.07 rows=2500 width=37) (actual time=0.180..2.948 rows=2546 loops=1)
                     ->  Hash  (cost=435.54..435.54 rows=5000 width=33) (actual time=4.338..4.338 rows=5000 loops=1)
                           ->  Broadcast Motion 2:2  (slice1; segments: 2)  (cost=0.00..435.54 rows=5000 width=33) (actual time=0.084..2.978 rows=5000 loops=1)
                                 ->  Seq Scan on lab1 lab1_1  (cost=0.00..431.07 rows=2500 width=33) (actual time=0.167..0.803 rows=2546 loops=1)
 Planning time: 24.088 ms
   (slice0)    Executor memory: 232K bytes.
   (slice1)    Executor memory: 204K bytes avg x 2 workers, 204K bytes max (seg0).
 * (slice2)    Executor memory: 67558K bytes avg x 2 workers, 67558K bytes max (seg0).  Work_mem: 65758K bytes max, 612237K bytes wanted.
 Memory used:  128000kB
 Memory wanted:  1837210kB
 Optimizer: Pivotal Optimizer (GPORCA)
 Execution time: 20194.755 ms
(22 rows)

As we see, WindowAgg has 12500000 rows, when its nested plan contains 25000000 rows, and WindowAgg has a number of rows less than expected exaclty the number of segments, which is caused by missing flow. This patch doesn't rely on flow, it uses gangSize of a current slice which is calculated correctly and always presents. gangSize is a simple and reliable way to get a number of segments which processed some part of a plan. However, orca has another bug related with how it calculates rows number. The specialty of orca is that it can return fractional number of rows, and explain should ceil them later. It can lead to a situation when orca returns a wrong rows number, explain ceils it and we get an incorrect answer. For example, such query

create table tcorr1(a int, b int);
create table tcorr2(a int, b int);

insert into tcorr1 values (1,99);
insert into tcorr2 values (1,1);

explain
select *
from tcorr1 out
where out.b in (select coalesce(tcorr2.a, 99)
                from tcorr1 left outer join tcorr2 on tcorr1.a=tcorr2.a+out.a);

will get a plan like this

                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 Result  (cost=0.00..1356692026.74 rows=1 width=8)
   Filter: (SubPlan 1)
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..431.00 rows=1 width=8)
         ->  Seq Scan on tcorr1  (cost=0.00..431.00 rows=1 width=8)
   SubPlan 1  (slice0)
     ->  Result  (cost=0.00..1324032.56 rows=2 width=4)
           ->  Nested Loop Left Join  (cost=0.00..1324032.56 rows=2 width=4)
                 Join Filter: (tcorr1_1.a = (tcorr2.a + tcorr1.a))
                 ->  Materialize  (cost=0.00..431.00 rows=1 width=4)
                       ->  Gather Motion 3:1  (slice3; segments: 3)  (cost=0.00..431.00 rows=1 width=4)
                             ->  Seq Scan on tcorr1 tcorr1_1  (cost=0.00..431.00 rows=1 width=4)
                 ->  Materialize  (cost=0.00..431.00 rows=1 width=4)
                       ->  Gather Motion 3:1  (slice2; segments: 3)  (cost=0.00..431.00 rows=1 width=4)
                             ->  Seq Scan on tcorr2  (cost=0.00..431.00 rows=1 width=4)
 Optimizer: Pivotal Optimizer (GPORCA)
(15 rows)

As we see, nested loop join has two rows when it can have only one. This happens because orca return 1.2 rows for this node and explain ceils this number to 2.

Jul 14 '22 21:07 dnskvlnk

Thanks a lot for your contribution.

PR pipeline fails at

planner:

=====================================================================
DIFF FILE: ../gpdb_src/src/test/regress/regression.diffs
----------------------------------------------------------------------

--- \/tmp\/build\/d62a0504\/gpdb_src\/src\/test\/regress\/expected\/qp_orca_fallback\.out	2022-07-14 22:20:52.564755024 +0000
+++ \/tmp\/build\/d62a0504\/gpdb_src\/src\/test\/regress\/results\/qp_orca_fallback\.out	2022-07-14 22:20:52.576756178 +0000
@@ -476,58 +480,3 @@
 }
 GP_IGNORE:(5 rows)
 
--- ORCA should fallback in case there are any CTE Consumers beneath a duplicate-hazard motion
--- Should be removed after issue#13039 is fixed
-GP_IGNORE:-- start_ignore
-GP_IGNORE:DROP TABLE IF EXISTS tbl1, tbl2;
-GP_IGNORE:CREATE TABLE tbl2 (
-GP_IGNORE:	id numeric NULL,
-GP_IGNORE:	refrcode varchar(255) NULL,
-GP_IGNORE:	referenceid numeric NULL
-GP_IGNORE:)
-GP_IGNORE:DISTRIBUTED REPLICATED;
-GP_IGNORE:
-GP_IGNORE:CREATE TABLE tbl1 (
-GP_IGNORE:	id bigserial NOT NULL,
-GP_IGNORE:	iscalctrg varchar(15) NOT NULL,
-GP_IGNORE:	iscalcdetail varchar(15) NULL
-GP_IGNORE:)
-GP_IGNORE:DISTRIBUTED REPLICATED;
-GP_IGNORE:-- end_ignore
-EXPLAIN WITH
-t1 AS (SELECT * FROM tbl1),
-t2 AS (SELECT id, refrcode FROM tbl2 WHERE REFERENCEID = 101991)
- SELECT p.*FROM t1 p
-  JOIN t2 r
-   ON p.isCalcTRG = r.RefrCode
-  JOIN t2 r1
-   ON p.isCalcDetail = r1.RefrCode
- LIMIT 1;
-                                         QUERY PLAN                                          
----------------------------------------------------------------------------------------------
-                                       Filter: (referenceid = '101991'::numeric)
-                                 ->  Seq Scan on tbl2  (cost=0.00..166.25 rows=6 width=548)
-                                 Filter: (referenceid = '101991'::numeric)
-                           ->  Seq Scan on tbl2 tbl2_1  (cost=0.00..166.25 rows=6 width=548)
-                           ->  Subquery Scan on r  (cost=0.00..166.30 rows=6 width=548)
-                     ->  Hash  (cost=166.30..166.30 rows=6 width=548)
-                     ->  Seq Scan on tbl1  (cost=0.00..344.00 rows=24400 width=104)
-                     ->  Subquery Scan on r1  (cost=0.00..166.30 rows=6 width=548)
-                     Hash Cond: ((tbl1.iscalctrg)::text = (r.refrcode)::text)
-               ->  Hash  (cost=166.30..166.30 rows=6 width=548)
-               ->  Hash Join  (cost=0.17..580.97 rows=130 width=104)
-               Hash Cond: ((tbl1.iscalcdetail)::text = (r1.refrcode)::text)
-         ->  Hash Join  (cost=0.34..581.69 rows=8 width=104)
-   ->  Gather Motion 1:1  (slice1; segments: 1)  (cost=0.34..581.83 rows=8 width=104)
- Limit  (cost=0.34..83.24 rows=1 width=104)
-(16 rows)
-
--- Orca should fallback for functions returning composite types
-create type compType as (a int, b int);
-create function myfunc5() returns compType IMMUTABLE as $$ select 2,3 $$ language sql;
-select a from myfunc5();
- a 
----
- 2
-(1 row)
-

======================================================================

ORCA:

========================
 2 of 547 tests failed. 
========================

The differences that caused some tests to fail can be viewed in the
file "/tmp/build/b235ddbb/gpdb_src/src/test/regress/regression.diffs".  A copy of the test summary that you see
above is saved in the file "/tmp/build/b235ddbb/gpdb_src/src/test/regress/regression.out".

make[2]: *** [installcheck-good] Error 1
make[1]: *** [installcheck-regress-recurse] Error 2
make: *** [installcheck-world-src/test-recurse] Error 2

======================================================================
DIFF FILE: ../gpdb_src/src/test/regress/regression.diffs
----------------------------------------------------------------------

--- \/tmp\/build\/b235ddbb\/gpdb_src\/src\/test\/regress\/expected\/gporca_optimizer\.out	2022-07-14 22:23:08.191662056 +0000
+++ \/tmp\/build\/b235ddbb\/gpdb_src\/src\/test\/regress\/results\/gporca\.out	2022-07-14 22:23:08.575688795 +0000
@@ -15184,30 +15218,30 @@
                                  Index Cond: (material_bitmapscan.k = 'Thu Jun 03 00:00:00 2021'::timestamp without time zone)
                                  Output: share0_ref2.i, share0_ref2.j
                            ->  Bitmap Index Scan on material_bitmapscan_idx  (cost=0.00..0.00 rows=0 width=0)
-                           ->  Shared Scan (share slice:id 1:0)  (cost=0.00..431.00 rows=1 width=4)
+                           ->  Shared Scan (share slice:id 1:0)  (cost=0.00..431.00 rows=3 width=4)
                            Filter: ((material_bitmapscan.i = 2) AND (material_bitmapscan.j = 2) AND (material_bitmapscan.l = 'Thu Jun 03 00:00:00 2021'::timestamp without time zone))
                            Filter: (share0_ref2.j = 2)
                            Output: material_bitmapscan.i, material_bitmapscan.j
                            Output: share0_ref2.j
                            Output: share0_ref3.i, share0_ref3.j
                            Recheck Cond: (material_bitmapscan.k = 'Thu Jun 03 00:00:00 2021'::timestamp without time zone)
-                     ->  Bitmap Heap Scan on orca.material_bitmapscan  (cost=0.00..390.25 rows=1 width=8)
-                     ->  Result  (cost=0.00..431.00 rows=1 width=4)
-                     ->  Shared Scan (share slice:id 1:0)  (cost=0.00..431.00 rows=1 width=8)
+                     ->  Bitmap Heap Scan on orca.material_bitmapscan  (cost=0.00..390.25 rows=3 width=8)
+                     ->  Result  (cost=0.00..431.00 rows=3 width=4)
+                     ->  Shared Scan (share slice:id 1:0)  (cost=0.00..431.00 rows=3 width=8)
                      Filter: (share0_ref3.j = 2)
                      Output: material_bitmapscan.i, material_bitmapscan.j
                      Output: share0_ref2.j
                      Output: share0_ref3.i, share0_ref3.j
-               ->  Hash  (cost=431.00..431.00 rows=1 width=4)
-               ->  Materialize  (cost=0.00..390.25 rows=1 width=1)
-               ->  Result  (cost=0.00..431.00 rows=1 width=8)
+               ->  Hash  (cost=431.00..431.00 rows=3 width=4)
+               ->  Materialize  (cost=0.00..390.25 rows=3 width=1)
+               ->  Result  (cost=0.00..431.00 rows=3 width=8)
                Hash Cond: (share0_ref3.j = share0_ref2.j)
                Output: share0_ref1.i, share0_ref1.j
                Output: share0_ref3.i
-         ->  Hash Join  (cost=0.00..862.00 rows=1 width=4)
-         ->  Shared Scan (share slice:id 1:0)  (cost=0.00..390.25 rows=1 width=1)
+         ->  Hash Join  (cost=0.00..862.00 rows=3 width=4)
+         ->  Shared Scan (share slice:id 1:0)  (cost=0.00..390.25 rows=3 width=1)
          Output: share0_ref3.i
-   ->  Sequence  (cost=0.00..1252.25 rows=1 width=4)
+   ->  Sequence  (cost=0.00..1252.25 rows=3 width=4)
    Output: share0_ref3.i
  Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..1252.25 rows=1 width=4)
 (31 rows)
@@ -16294,7 +16328,7 @@
                            ->  Hash  (cost=431.00..431.00 rows=1 width=8)
                            Hash Cond: (window_agg_test.j = window_agg_test_1.j)
                      ->  Hash Join  (cost=0.00..862.00 rows=1 width=22)
-               ->  Split  (cost=0.00..862.00 rows=1 width=22)
+               ->  Split  (cost=0.00..862.00 rows=2 width=22)
          ->  Explicit Redistribute Motion 1:3  (slice3; segments: 1)  (cost=0.00..862.00 rows=2 width=22)
    ->  Result  (cost=0.00..862.00 rows=1 width=26)
  Update  (cost=0.00..862.07 rows=1 width=1)

======================================================================

--- \/tmp\/build\/b235ddbb\/gpdb_src\/src\/test\/regress\/expected\/qp_orca_fallback_optimizer\.out	2022-07-14 22:33:56.352791623 +0000
+++ \/tmp\/build\/b235ddbb\/gpdb_src\/src\/test\/regress\/results\/qp_orca_fallback\.out	2022-07-14 22:33:56.364792458 +0000
@@ -594,62 +598,3 @@
 }
 GP_IGNORE:(5 rows)
 
--- ORCA should fallback in case there are any CTE Consumers beneath a duplicate-hazard motion
--- Should be removed after issue#13039 is fixed
-GP_IGNORE:-- start_ignore
-GP_IGNORE:DROP TABLE IF EXISTS tbl1, tbl2;
-GP_IGNORE:CREATE TABLE tbl2 (
-GP_IGNORE:	id numeric NULL,
-GP_IGNORE:	refrcode varchar(255) NULL,
-GP_IGNORE:	referenceid numeric NULL
-GP_IGNORE:)
-GP_IGNORE:DISTRIBUTED REPLICATED;
-GP_IGNORE:
-GP_IGNORE:CREATE TABLE tbl1 (
-GP_IGNORE:	id bigserial NOT NULL,
-GP_IGNORE:	iscalctrg varchar(15) NOT NULL,
-GP_IGNORE:	iscalcdetail varchar(15) NULL
-GP_IGNORE:)
-GP_IGNORE:DISTRIBUTED REPLICATED;
-GP_IGNORE:-- end_ignore
-EXPLAIN WITH
-t1 AS (SELECT * FROM tbl1),
-t2 AS (SELECT id, refrcode FROM tbl2 WHERE REFERENCEID = 101991)
- SELECT p.*FROM t1 p
-  JOIN t2 r
-   ON p.isCalcTRG = r.RefrCode
-  JOIN t2 r1
-   ON p.isCalcDetail = r1.RefrCode
- LIMIT 1;
-INFO:  GPORCA failed to produce a plan, falling back to planner
-DETAIL:  Feature not supported: CTE Consumer without the appropriate CTE Producer under a duplicate-hazard motion or a tainted replicated node
-                                         QUERY PLAN                                          
----------------------------------------------------------------------------------------------
-                                       Filter: (referenceid = '101991'::numeric)
-                                 ->  Seq Scan on tbl2  (cost=0.00..166.25 rows=6 width=548)
-                                 Filter: (referenceid = '101991'::numeric)
-                           ->  Seq Scan on tbl2 tbl2_1  (cost=0.00..166.25 rows=6 width=548)
-                           ->  Subquery Scan on r  (cost=0.00..166.30 rows=6 width=548)
-                     ->  Hash  (cost=166.30..166.30 rows=6 width=548)
-                     ->  Seq Scan on tbl1  (cost=0.00..344.00 rows=24400 width=104)
-                     ->  Subquery Scan on r1  (cost=0.00..166.30 rows=6 width=548)
-                     Hash Cond: ((tbl1.iscalctrg)::text = (r.refrcode)::text)
-               ->  Hash  (cost=166.30..166.30 rows=6 width=548)
-               ->  Hash Join  (cost=0.17..580.97 rows=130 width=104)
-               Hash Cond: ((tbl1.iscalcdetail)::text = (r1.refrcode)::text)
-         ->  Hash Join  (cost=0.34..581.69 rows=8 width=104)
-   ->  Gather Motion 1:1  (slice1; segments: 1)  (cost=0.34..581.83 rows=8 width=104)
- Limit  (cost=0.34..83.24 rows=1 width=104)
-(16 rows)
-
--- Orca should fallback for functions returning composite types
-create type compType as (a int, b int);
-create function myfunc5() returns compType IMMUTABLE as $$ select 2,3 $$ language sql;
-select a from myfunc5();
-INFO:  GPORCA failed to produce a plan, falling back to planner
-DETAIL:  Feature not supported: Row-type variable
- a 
----
- 2
-(1 row)
-

======================================================================

Jul 14 '22 23:07 kainwen

@kainwen, hi, all the tests have passed, could you take a look?

Jul 25 '22 09:07 dnskvlnk

And IMO, the key point is that we should make a clear consensus on the exact meaning of rows in EXPLAIN results, should it be all tuples? or should it be all tuples / gang size ? And is it reasonable for different node types to have different results?

I found the plan rows in EXPLAIN results are not consistent in MASTER branch:

postgres=# explain update scale_factor_repl a set c1 = b.c2 from scale_factor_part_distr b returning *;
                                                 QUERY PLAN                                                 
------------------------------------------------------------------------------------------------------------
 Explicit Gather Motion 3:1  (slice1; segments: 3)  (cost=10000000003.58..10000000003.58 rows=100 width=28)
   ->  Update on scale_factor_repl a  (cost=10000000000.00..10000000003.58 rows=100 width=28)
         ->  Nested Loop  (cost=10000000000.00..10000000003.58 rows=100 width=28)
               ->  Broadcast Motion 2:3  (slice2; segments: 2)  (cost=0.00..1.20 rows=10 width=14)
                     ->  Seq Scan on scale_factor_part_distr b  (cost=0.00..1.05 rows=5 width=14)
               ->  Materialize  (cost=0.00..1.15 rows=10 width=14)
                     ->  Seq Scan on scale_factor_repl a  (cost=0.00..1.10 rows=10 width=14)
 Optimizer: Postgres query optimizer
(8 rows)

As we can see the table of scale_factor_part_distr and scale_factor_repl both have 10 tuples, but the plan rows in Seq Scan on scale_factor_part_distr b just have 5, not 10.

Just ignore this comment, I missed the fact that table scale_factor_repl is REPLICATED.

Sep 21 '22 02:09 SmartKeyerror

Hi, you're absolutely correct, that's not an expected behavior. Moreover, we've faced another problem which makes this patch incorrect. Currently, gpdb always divides rows estimation by number of segments in the cluster, except Motions, because ORCA fills flow only for nodes under Motions, but never fills locus type. So there is a counterpart that multiplies rows estimation for replicated distribution in the ORCA code base. But this block of code handles Broadcast Motions too, and removing this kludge breaks about 47 ORCA mdp-tests. Ideally, we need to move all the scaling logic to the explanation part.

Oct 11 '22 11:10 dnskvlnk

Based on all the review comments, it is clear the change is not in an acceptable state. Hence, closing this PR. Great effort and reviews.

Dec 15 '22 17:12 ashwinstar

gpdb gpdb copied to clipboard

Explain provides wrong rows number when flow is missing

gpdb
gpdb copied to clipboard