gpdb
gpdb copied to clipboard
[planner] Consider correlated varnos in ANY subselect.
When correlated Var refer to unavailable rels, in subselect of IN/ANY sublink, convert this sublink to semi join will lead to a wrong result. For example,
EXPLAIN (Costs off) SELECT * FROM
(values (1), (2), (3)) t1(a2)
left join (values (5)) t2(a2)
ON (
t2.a2 in (
SELECT a1 FROM
(values (5,1), (5,2), (5,3)) t3(a1, a2)
WHERE t3.a2 = t1.a2
)
);
QUERY PLAN
----------------------------------------------------------
Hash Right Join
Hash Cond: ("*VALUES*_1".column2 = "*VALUES*".column1)
-> Nested Loop Semi Join
-> Result
-> Values Scan on "*VALUES*_1"
Filter: (5 = column1)
-> Hash
-> Values Scan on "*VALUES*"
Optimizer: Postgres-based planner
(9 rows)
The On-Clause of Right Join refer to the right relation of Semi Join, which obviously will result in a wrong answer, because the right relation of Semi Join will only keep first match one.
Comparing with the EXISTS sublink, EXISTS sublink will be pull up when:
- WhereClause should contain vars of parents level.
- The rest of the subselect should not contain correlated vars.
- All correlated vars should refer to available_rels.
But IN/ANY sublink only check testexpr. This commit will consider subselect as well.
Seems the same fix as https://github.com/greenplum-db/gpdb/pull/16908/ but from a difference perspective.
Seems the same fix as #16908 but from a difference perspective.
Oh, thanks. I did not see this commit before. Emmm, In my opinion, convert sublink to semi join as more as possible is a good way for optimization. So I follow the EXISTS sublink to fix it. But on the other way, as a bug fix ,avoid pulling up all correlated var in IN/ANY subselect also seems a good way.
The rest of the subselect should not contain correlated vars.
@fishtree1161 I think subselect could contain correlated vars, but as you said The On-Clause of Right Join refer to the right relation of Semi Join, which obviously will result in a wrong answer, because the right relation of Semi Join will only keep first match one. which result in wrong because of the semijoin only pop out one tuple(values (5))
For my view, there is something wrong with join-reorder. (A left join B) semijoin C for this current issue we cannot convert it to A left join (B semijoin C)
(t1 LJ t2 on true) semi join t3 on t3.a2 = t1.a2 and t2.a2 = t3.a1 This should be a corrent plan think, but we should spend more time investigating why it happen,and correct me if I'm wrong.
The rest of the subselect should not contain correlated vars.
@fishtree1161 I think subselect could contain correlated vars, but as you said The On-Clause of Right Join refer to the right relation of Semi Join, which obviously will result in a wrong answer, because the right relation of Semi Join will only keep first match one. which result in wrong because of the semijoin only pop out one tuple(values (5))
For my view, there is something wrong with join-reorder. (A left join B) semijoin C for this current issue we cannot convert it to A left join (B semijoin C)
(t1 LJ t2 on true) semi join t3 on t3.a2 = t1.a2 and t2.a2 = t3.a1 This should be a corrent plan think, but we should spend more time investigating why it happen,and correct me if I'm wrong.
@charliettxx If the outer one is inner join, we can swap them, and planner really do it. But it may lead to wrong result between left join and semi join. Because the semi join to C will filter the rows of A, but the rows of "A left join xxx" 's result will never fewer than the A's.
The rest of the subselect should not contain correlated vars.
@fishtree1161 I think subselect could contain correlated vars, but as you said The On-Clause of Right Join refer to the right relation of Semi Join, which obviously will result in a wrong answer, because the right relation of Semi Join will only keep first match one. which result in wrong because of the semijoin only pop out one tuple(values (5))
For my view, there is something wrong with join-reorder. (A left join B) semijoin C for this current issue we cannot convert it to A left join (B semijoin C)
(t1 LJ t2 on true) semi join t3 on t3.a2 = t1.a2 and t2.a2 = t3.a1 This should be a corrent plan think, but we should spend more time investigating why it happen,and correct me if I'm wrong.
@charliettxx And I also think subselect could contain correlated vars, I just forbid the rest space with out whereClause contain correlated vars, and all correlated vars should refer to available_rels. Just like what EXISTS did.
whereClause contain correlated vars, and all correlated vars should refer to available_rels. Just like what EXISTS did.
Got it, I understand your idea. I list some scenarios after this patch and it's better to add them into regression test.
create table t1(a int, b int, c int);
create table t2(a int, b int, c int);
create table t3(a int, b int, c int);
create table t4(a int, b int, c int);
supporte case:
1. refer outer right
SELECT * FROM
t1 left join t2
ON (
t2.a in (
SELECT a FROM t3
WHERE t3.b = t2.b
)
);
2. refer levelup > 1 var
select (SELECT t1.a FROM
t1 left join t2
ON (
t2.a in (
SELECT a FROM t3
WHERE t3.b = t4.b
)
)) a
from t4;
upsupport case:(back to subplan)
1. refer outer left
EXPLAIN (Costs off) SELECT * FROM
t1 left join t2
ON (
t2.a in (
SELECT a FROM t3
WHERE t3.b = t1.b
)
);
2. refer join clause
EXPLAIN (Costs off) SELECT * FROM
t1 left join t2
ON (
t2.a in (
SELECT t3.a FROM t3 join t4 on t4.c = t2.c
WHERE t3.b = t2.b
)
);
The patch Looks good to me, please @kainwen @z-wenlin also take a look.
There are some failed test cases.
test subselect_gp ... FAILED 3131 ms (diff 783 ms)
test qp_dropped_cols ... FAILED 37471 ms (diff 2303 ms)
test qp_correlated_query ... FAILED 1766 ms (diff 721 ms)
test gporca ... FAILED 18033 ms (diff 2202 ms)
@z-wenlin @charliettxx I‘ve add & fix these regression test, please check it again. During fixing these tests, I found that correlated vars in targetlist (select) of sublink may not affect, we can convert them (exists do not bother the targetlist). What do you think about it.
I found that correlated vars in targetlist (select) of sublink may not affect, we can convert them (exists do not bother the targetlist). What do you think about it.
we also found that, yes, targetlist and whereclause in subselect could be converted, we can do that.
@charliettxx I've changed the code & tests, and now the algorithm becomes:
- The testexpr & targetlist & whereClause can contain the correlated vars but should refer to available_rels.
- The rest of sublink should not contain the correlated vars of parent level.
- Higher level vars will not affect current level conver.
targetlist should refer to available_rels, because the case like "true in (select xxx=xxx from ...)" for example: SELECT * FROM (values (1), (2), (3)) t1(a2) left join (values (5)) t2(a2) ON ( (t2.a2, true) in ( SELECT t3.a1, t3.a2 = t1.a2 FROM (values (5,1), (5,2), (5,3)) t3(a1, a2) ) );
https://dev.ci.gpdb.pivotal.io/teams/main/pipelines/gpdb-dev-left-on-semi-in_c-rocky8
https://dev.ci.gpdb.pivotal.io/teams/main/pipelines/gpdb-dev-left-on-semi-in-rocky8
@charliettxx Sorry, I see nothing but only Concourse CI page structure. Is there any tests failed about this PR? What should I do to see the result? Or can you figure out which tests I should retry.
I see nothing but only Concourse CI page structure. Is there any tests failed about this PR?
maybe you didn't have permission to check pipeline results, I'll paste here if anything else failed and we can discussed it.
Thanks so much for your findings and discussion here.
I think this is introduced by commit https://github.com/greenplum-db/gpdb/commit/efb2777a84b255fa7bd3f82d7c532fc7caa7ddcd which remove the line of code
if (contain_vars_of_level((Node *) subselect, 1))
return NULL;
The code change is there since GP6 (means GP4, GP5 planner don't do the work).
The first feeling is that removal of this is a mistake.
For correlated any sublink handling, we can rewrite the SQL to use exists sublink. Sometimes pull up to convert to a nestloop lateral join might miss the choice of hashed subplan.
Will
- try to ping QPA team and get some feedback
- try to find some material on this part
@kainwen Thanks for your attention, and I agree with most of your opinions. Here are some of my suggestions, I hope they will be helpful to you.
- For correlated any sublink handling, IN/ANY sublink + (where testexpr = targetlist) == EXISTS sublink. So convert IN/ANY to EXISTS and treat them in one way may be a better solution. (This PR is aim to treating them in the same way.)
- For uncorrelated sublink, let it be a subplan may just be ok. But for correlated one, pull up to semi join (may be hash or merge or NLJ) is almost better than subplan. Because the correlated subplan is just like the nested loop join -- one row one result.
- For this wrong example, one of key point is that "t3.a2 = t1.a2" clause will be treated as the clause of t1 & t3 after pullup, because of the clause polling during join reordering. I think the better way may be the lateral join clause? Or just don't let this filter pullup over the Semijoin?
- In addition, Correlated vars through the motion !!! The sword of Damocles !!! When talking about the lateral join or correlated subplan, this is the thing let me afraid most. I think this solution may be in executor, which can broadcast correlated var(param) through motion?