gpdb Extra motion node when insert into randomly distributed table from hash distributed table on 6X

trafficstars

This is a behavior mismatch between 6X and master. Whether the extra motion node is desirable is debatable but it should be same for 6X and master.

Greenplum version or build

PostgreSQL 12beta2 (Greenplum Database 7.0.0-alpha.0+4afa059 build dev) PostgreSQL 9.4.24 (Greenplum Database 6.11.2+dev.31.gfbc0a7be4c build dev)

OS version and uname -a

Ubuntu 20.04 Linux dev 5.4.0-51-generic #56-Ubuntu SMP Mon Oct 5 14:28:49 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

autoconf options used ( config.status --config )

'--disable-orca' '--disable-pxf' '--disable-gpcloud' '--enable-debug'

Step to reproduce the behavior

CREATE TABLE t1(a int) DISTRIBUTED RANDOMLY; CREATE TABLE t2(a int) DISTRIBUTED BY(a); EXPLAIN INSERT INTO t1 SELECT a FROM t2;

Expected behavior

Plans generated by master and 6X should be the same.

Actual behavior

6X plan has an extra motion node. 6X:

                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Insert on t1  (cost=0.00..1063.00 rows=32100 width=4)
   ->  Redistribute Motion 3:3  (slice1; segments: 3)  (cost=0.00..1063.00 rows=32100 width=4)
         ->  Seq Scan on t2  (cost=0.00..1063.00 rows=32100 width=4)
 Optimizer: Postgres query optimizer
(4 rows)

master:

                          QUERY PLAN                          
--------------------------------------------------------------
 Insert on t1  (cost=0.00..355.00 rows=32100 width=4)
   ->  Seq Scan on t2  (cost=0.00..355.00 rows=32100 width=4)
 Optimizer: Postgres query optimizer
(3 rows)

Oct 20 '20 09:10 pf-qiu

I think we need to figure out one question before going deep into this plan difference.

Should random distribution guarantee a balanced data distribution of a table?
If so, motion is needed. If not, the new plan might have better performance.

Oct 20 '20 10:10 lij55

I think we need to figure out one question before going deep into this plan difference.

Should random distribution guarantee a balanced data distribution of a table? If so, motion is needed. If not, the new plan might have better performance.

I believe random motion here guarantee a balanced data distribution of a table because it computes target segment using random() function for each tuple.

Oct 20 '20 10:10 kainwen-zz

I believe random motion here guarantee a balanced data distribution of a table

and it looks like that the plan of master doesn't have the random motion yet...

Oct 20 '20 10:10 lij55

The 6X documentation refers to "randomly" and "round-robin fashion" interchangeably. My opinion is distributed randomly means no assumptions can be made about underlying data; motion node shouldn't be required when routing tuples into a randomly distributed table.

Why should the behavior on Greenplum 7 be consistent with Greenplum 6?

Mar 04 '21 07:03 asimrp

I think the point here is that both "round-robin" or "no assumption about data" are valid explanation of "randomly", and has their own use cases. We can't say one is clearly better than the other. If a user upgrade from previous version to 7, and demand the round-robin way, would it be considered a regression? If we change our official explanation of "randomly" in documentation, how should we handle the old use case?

Mar 05 '21 02:03 pf-qiu

with @ashwinstar 7x is more efficient in this case we should capture the behaviour change in docs @dyozie can you take care of this

Dec 23 '22 01:12 dpandhi-git

I think we can just close this issue. For me doc change is not needed.

Feb 14 '23 09:02 kainwen

gpdb gpdb copied to clipboard

Extra motion node when insert into randomly distributed table from hash distributed table on 6X

Greenplum version or build

OS version and uname -a

autoconf options used ( config.status --config )

Step to reproduce the behavior

Expected behavior

Actual behavior

gpdb
gpdb copied to clipboard