gpdb
gpdb copied to clipboard
Extra motion node when insert into randomly distributed table from hash distributed table on 6X
This is a behavior mismatch between 6X and master. Whether the extra motion node is desirable is debatable but it should be same for 6X and master.
Greenplum version or build
PostgreSQL 12beta2 (Greenplum Database 7.0.0-alpha.0+4afa059 build dev) PostgreSQL 9.4.24 (Greenplum Database 6.11.2+dev.31.gfbc0a7be4c build dev)
OS version and uname -a
Ubuntu 20.04 Linux dev 5.4.0-51-generic #56-Ubuntu SMP Mon Oct 5 14:28:49 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
autoconf options used ( config.status --config )
'--disable-orca' '--disable-pxf' '--disable-gpcloud' '--enable-debug'
Step to reproduce the behavior
CREATE TABLE t1(a int) DISTRIBUTED RANDOMLY; CREATE TABLE t2(a int) DISTRIBUTED BY(a); EXPLAIN INSERT INTO t1 SELECT a FROM t2;
Expected behavior
Plans generated by master and 6X should be the same.
Actual behavior
6X plan has an extra motion node. 6X:
QUERY PLAN
-----------------------------------------------------------------------------------------------
Insert on t1 (cost=0.00..1063.00 rows=32100 width=4)
-> Redistribute Motion 3:3 (slice1; segments: 3) (cost=0.00..1063.00 rows=32100 width=4)
-> Seq Scan on t2 (cost=0.00..1063.00 rows=32100 width=4)
Optimizer: Postgres query optimizer
(4 rows)
master:
QUERY PLAN
--------------------------------------------------------------
Insert on t1 (cost=0.00..355.00 rows=32100 width=4)
-> Seq Scan on t2 (cost=0.00..355.00 rows=32100 width=4)
Optimizer: Postgres query optimizer
(3 rows)
I think we need to figure out one question before going deep into this plan difference.
Should random distribution guarantee a balanced data distribution of a table?
If so, motion is needed. If not, the new plan might have better performance.
I think we need to figure out one question before going deep into this plan difference.
Should
randomdistribution guarantee a balanced data distribution of a table? If so, motion is needed. If not, the new plan might have better performance.
I believe random motion here guarantee a balanced data distribution of a table because it computes target segment using random() function for each tuple.
I believe random motion here guarantee a balanced data distribution of a table
and it looks like that the plan of master doesn't have the random motion yet...
The 6X documentation refers to "randomly" and "round-robin fashion" interchangeably. My opinion is distributed randomly means no assumptions can be made about underlying data; motion node shouldn't be required when routing tuples into a randomly distributed table.
Why should the behavior on Greenplum 7 be consistent with Greenplum 6?
I think the point here is that both "round-robin" or "no assumption about data" are valid explanation of "randomly", and has their own use cases. We can't say one is clearly better than the other. If a user upgrade from previous version to 7, and demand the round-robin way, would it be considered a regression? If we change our official explanation of "randomly" in documentation, how should we handle the old use case?
with @ashwinstar 7x is more efficient in this case we should capture the behaviour change in docs @dyozie can you take care of this
I think we can just close this issue. For me doc change is not needed.