gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

[planner] Consider correlated varnos in ANY subselect.

Open fishtree1161 opened this issue 1 year ago • 15 comments

When correlated Var refer to unavailable rels, in subselect of IN/ANY sublink, convert this sublink to semi join will lead to a wrong result. For example,

EXPLAIN (Costs off) SELECT * FROM
(values (1), (2), (3)) t1(a2)
left join (values (5)) t2(a2)
ON (
    t2.a2 in (
        SELECT a1 FROM
        (values (5,1), (5,2), (5,3)) t3(a1, a2)
        WHERE t3.a2 = t1.a2
    )
);
                        QUERY PLAN
----------------------------------------------------------
 Hash Right Join
   Hash Cond: ("*VALUES*_1".column2 = "*VALUES*".column1)
   ->  Nested Loop Semi Join
         ->  Result
         ->  Values Scan on "*VALUES*_1"
               Filter: (5 = column1)
   ->  Hash
         ->  Values Scan on "*VALUES*"
 Optimizer: Postgres-based planner
(9 rows)

The On-Clause of Right Join refer to the right relation of Semi Join, which obviously will result in a wrong answer, because the right relation of Semi Join will only keep first match one.

Comparing with the EXISTS sublink, EXISTS sublink will be pull up when:

  1. WhereClause should contain vars of parents level.
  2. The rest of the subselect should not contain correlated vars.
  3. All correlated vars should refer to available_rels.

But IN/ANY sublink only check testexpr. This commit will consider subselect as well.

fishtree1161 avatar Jan 12 '24 06:01 fishtree1161

Seems the same fix as https://github.com/greenplum-db/gpdb/pull/16908/ but from a difference perspective.

kainwen avatar Jan 12 '24 06:01 kainwen

Seems the same fix as #16908 but from a difference perspective.

Oh, thanks. I did not see this commit before. Emmm, In my opinion, convert sublink to semi join as more as possible is a good way for optimization. So I follow the EXISTS sublink to fix it. But on the other way, as a bug fix ,avoid pulling up all correlated var in IN/ANY subselect also seems a good way.

fishtree1161 avatar Jan 12 '24 07:01 fishtree1161

The rest of the subselect should not contain correlated vars.

@fishtree1161 I think subselect could contain correlated vars, but as you said The On-Clause of Right Join refer to the right relation of Semi Join, which obviously will result in a wrong answer, because the right relation of Semi Join will only keep first match one. which result in wrong because of the semijoin only pop out one tuple(values (5))

For my view, there is something wrong with join-reorder. (A left join B) semijoin C for this current issue we cannot convert it to A left join (B semijoin C)

(t1 LJ t2 on true) semi join t3 on t3.a2 = t1.a2 and t2.a2 = t3.a1 This should be a corrent plan think, but we should spend more time investigating why it happen,and correct me if I'm wrong.

charliettxx avatar Jan 17 '24 04:01 charliettxx

The rest of the subselect should not contain correlated vars.

@fishtree1161 I think subselect could contain correlated vars, but as you said The On-Clause of Right Join refer to the right relation of Semi Join, which obviously will result in a wrong answer, because the right relation of Semi Join will only keep first match one. which result in wrong because of the semijoin only pop out one tuple(values (5))

For my view, there is something wrong with join-reorder. (A left join B) semijoin C for this current issue we cannot convert it to A left join (B semijoin C)

(t1 LJ t2 on true) semi join t3 on t3.a2 = t1.a2 and t2.a2 = t3.a1 This should be a corrent plan think, but we should spend more time investigating why it happen,and correct me if I'm wrong.

@charliettxx If the outer one is inner join, we can swap them, and planner really do it. But it may lead to wrong result between left join and semi join. Because the semi join to C will filter the rows of A, but the rows of "A left join xxx" 's result will never fewer than the A's.

fishtree1161 avatar Jan 17 '24 04:01 fishtree1161

The rest of the subselect should not contain correlated vars.

@fishtree1161 I think subselect could contain correlated vars, but as you said The On-Clause of Right Join refer to the right relation of Semi Join, which obviously will result in a wrong answer, because the right relation of Semi Join will only keep first match one. which result in wrong because of the semijoin only pop out one tuple(values (5))

For my view, there is something wrong with join-reorder. (A left join B) semijoin C for this current issue we cannot convert it to A left join (B semijoin C)

(t1 LJ t2 on true) semi join t3 on t3.a2 = t1.a2 and t2.a2 = t3.a1 This should be a corrent plan think, but we should spend more time investigating why it happen,and correct me if I'm wrong.

@charliettxx And I also think subselect could contain correlated vars, I just forbid the rest space with out whereClause contain correlated vars, and all correlated vars should refer to available_rels. Just like what EXISTS did.

fishtree1161 avatar Jan 17 '24 04:01 fishtree1161

whereClause contain correlated vars, and all correlated vars should refer to available_rels. Just like what EXISTS did.

Got it, I understand your idea. I list some scenarios after this patch and it's better to add them into regression test.

create table t1(a int, b int, c int);
create table t2(a int, b int, c int);
create table t3(a int, b int, c int);
create table t4(a int, b int, c int);

supporte case:

1. refer outer right
 
SELECT * FROM
t1 left join t2
ON (
    t2.a in (
        SELECT a FROM t3
        WHERE t3.b = t2.b
    )
);

2. refer levelup > 1 var

select (SELECT t1.a FROM
t1 left join t2
ON (
    t2.a in (
        SELECT a FROM t3
        WHERE t3.b = t4.b
    )
)) a
from t4;

upsupport case:(back to subplan)

1. refer outer left

EXPLAIN (Costs off) SELECT * FROM
t1 left join t2
ON (
    t2.a in (
        SELECT a FROM t3
        WHERE t3.b = t1.b
    )
);

2. refer join clause

EXPLAIN (Costs off) SELECT * FROM
t1 left join t2
ON (
    t2.a in (
        SELECT t3.a FROM t3 join t4 on t4.c = t2.c
        WHERE t3.b = t2.b
    )
);

The patch Looks good to me, please @kainwen @z-wenlin also take a look.

charliettxx avatar Jan 17 '24 13:01 charliettxx

There are some failed test cases.

test subselect_gp                 ... FAILED     3131 ms (diff  783 ms)
test qp_dropped_cols              ... FAILED    37471 ms (diff 2303 ms)
test qp_correlated_query          ... FAILED     1766 ms (diff  721 ms)
test gporca                       ... FAILED    18033 ms (diff 2202 ms)

z-wenlin avatar Jan 18 '24 01:01 z-wenlin

@z-wenlin @charliettxx I‘ve add & fix these regression test, please check it again. During fixing these tests, I found that correlated vars in targetlist (select) of sublink may not affect, we can convert them (exists do not bother the targetlist). What do you think about it.

fishtree1161 avatar Jan 18 '24 04:01 fishtree1161

I found that correlated vars in targetlist (select) of sublink may not affect, we can convert them (exists do not bother the targetlist). What do you think about it.

we also found that, yes, targetlist and whereclause in subselect could be converted, we can do that.

charliettxx avatar Jan 18 '24 07:01 charliettxx

@charliettxx I've changed the code & tests, and now the algorithm becomes:

  1. The testexpr & targetlist & whereClause can contain the correlated vars but should refer to available_rels.
  2. The rest of sublink should not contain the correlated vars of parent level.
  3. Higher level vars will not affect current level conver.

targetlist should refer to available_rels, because the case like "true in (select xxx=xxx from ...)" for example: SELECT * FROM (values (1), (2), (3)) t1(a2) left join (values (5)) t2(a2) ON ( (t2.a2, true) in ( SELECT t3.a1, t3.a2 = t1.a2 FROM (values (5,1), (5,2), (5,3)) t3(a1, a2) ) );

fishtree1161 avatar Jan 18 '24 12:01 fishtree1161

https://dev.ci.gpdb.pivotal.io/teams/main/pipelines/gpdb-dev-left-on-semi-in_c-rocky8

charliettxx avatar Jan 22 '24 06:01 charliettxx

https://dev.ci.gpdb.pivotal.io/teams/main/pipelines/gpdb-dev-left-on-semi-in-rocky8

@charliettxx Sorry, I see nothing but only Concourse CI page structure. Is there any tests failed about this PR? What should I do to see the result? Or can you figure out which tests I should retry.

fishtree1161 avatar Jan 22 '24 06:01 fishtree1161

I see nothing but only Concourse CI page structure. Is there any tests failed about this PR?

maybe you didn't have permission to check pipeline results, I'll paste here if anything else failed and we can discussed it.

charliettxx avatar Jan 22 '24 09:01 charliettxx

Thanks so much for your findings and discussion here.

I think this is introduced by commit https://github.com/greenplum-db/gpdb/commit/efb2777a84b255fa7bd3f82d7c532fc7caa7ddcd which remove the line of code

if (contain_vars_of_level((Node *) subselect, 1))
        return NULL;

The code change is there since GP6 (means GP4, GP5 planner don't do the work).

The first feeling is that removal of this is a mistake.

For correlated any sublink handling, we can rewrite the SQL to use exists sublink. Sometimes pull up to convert to a nestloop lateral join might miss the choice of hashed subplan.

Will

  • try to ping QPA team and get some feedback
  • try to find some material on this part

kainwen avatar Jan 22 '24 14:01 kainwen

@kainwen Thanks for your attention, and I agree with most of your opinions. Here are some of my suggestions, I hope they will be helpful to you.

  1. For correlated any sublink handling, IN/ANY sublink + (where testexpr = targetlist) == EXISTS sublink. So convert IN/ANY to EXISTS and treat them in one way may be a better solution. (This PR is aim to treating them in the same way.)
  2. For uncorrelated sublink, let it be a subplan may just be ok. But for correlated one, pull up to semi join (may be hash or merge or NLJ) is almost better than subplan. Because the correlated subplan is just like the nested loop join -- one row one result.
  3. For this wrong example, one of key point is that "t3.a2 = t1.a2" clause will be treated as the clause of t1 & t3 after pullup, because of the clause polling during join reordering. I think the better way may be the lateral join clause? Or just don't let this filter pullup over the Semijoin?
  4. In addition, Correlated vars through the motion !!! The sword of Damocles !!! When talking about the lateral join or correlated subplan, this is the thing let me afraid most. I think this solution may be in executor, which can broadcast correlated var(param) through motion?

fishtree1161 avatar Jan 22 '24 15:01 fishtree1161