An incorrect query result, where the distributed query plan seems wrong
DROP TABLE IF EXISTS t1;
create table t1 (
vkey int4 ,
c10 int4
);
DROP TABLE IF EXISTS t3;
create table t3 (
vkey int4
);
insert into t3 (vkey) values (1);
insert into t1 (vkey,c10) values (4, -70);
I made t1 as a distributed table:
SELECT create_distributed_table('t1', 'vkey');
query:
select t3.vkey
from (t1 right outer join t3
on (t1.c10 = t3.vkey ))
where exists (select * from t3);
The result should be t3.vkey = 1 ,but actually returns nothing. The citus version is 12.5.
the distributed query plan of this query: Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=4) -> Distributed Subplan 1_1 -> Seq Scan on t3 (cost=0.00..35.50 rows=2550 width=4) -> Distributed Subplan 1_2 -> Seq Scan on t3 (cost=0.00..35.50 rows=2550 width=4) Task Count: 0 Tasks Shown: All
task count should not be 0.
The actual problem is that we directly push down this query and this is wrong:
set client_min_messages to debug3;
select t3.vkey
from (t1 right outer join t3
on (t1.c10 = t3.vkey ))
where exists (select * from t3);
DEBUG: no shard pruning constraints on t1 found
DEBUG: shard count after pruning for t1: 32
DEBUG: Router planner cannot handle multi-shard select queries
DEBUG: no shard pruning constraints on t1 found
DEBUG: shard count after pruning for t1: 32
DEBUG: assigned task 1 to node localhost:9701
DEBUG: assigned task 2 to node localhost:9702
DEBUG: assigned task 3 to node localhost:9701
DEBUG: assigned task 4 to node localhost:9702
..
DEBUG: assigned task 32 to node localhost:9702
So, for such a query, where t3 is a reference table (which we call as a recurring rel) and t1 is a distributed table, and where recurring relation is in the outer part of the outer part of the outer join and the distributed one is in the inner part of the join; we should normally go through recursive planning and should first recursively plan the distributed rel. See this;
/*
* Similarly, logical planner cannot handle outer joins when the outer rel
* is recurring, such as "<recurring> LEFT JOIN <distributed>". In that case,
* we convert distributed table into a subquery and recursively plan inner
* side of the outer join. That way, inner rel gets converted into an intermediate
* result and logical planner can handle the new query since it's of the from
* "<recurring> LEFT JOIN <recurring>".
*/
if (ShouldRecursivelyPlanOuterJoins(context))
{
RecursivelyPlanRecurringTupleOuterJoinWalker((Node *) query->jointree,
query, context);
}
Otherwise, i.e., when push-down this query, it causes returning incorrect result set because the tuples that don't exist in shards of the distributed table would appear in each outer join pushed down to the shards. In the past, because pushing down such a query would cause returning such an incorrect result set, we were throwing an error. But as of #6512, we support such queries by recursively planning the distributed table first. In other words, we first convert it into another recurring relation form which we call as "intermediate result". Although this is not so much ideal from performance perspective, this at least allow us supporting such queries in an appropriate way. Having said that, this issue doesn't seem specific to local tables but can be generalized to set returning functions and reference tables etc. too.
One thing that looks interesting is that when I remove the where clause from this function, we actually do what we're supposed to do:
set client_min_messages to debug2;
select * from (t2 full outer join t1 on(t2.vkey = t1.vkey ));
DEBUG: Router planner cannot handle multi-shard select queries
**DEBUG: recursively planning right side of the full join since the other side is a recurring rel**
**DEBUG: recursively planning distributed relation "t1" since it is part of a distributed join node that is outer joined with a recurring rel**
DEBUG: Wrapping relation "t1" to a subquery
DEBUG: Router planner cannot handle multi-shard select queries
DEBUG: generating subplan 10_1 for subquery SELECT vkey FROM public.t1 WHERE true
DEBUG: Plan 10 query after replacing subqueries and CTEs: SELECT t2.vkey, t1.vkey FROM (public.t2 FULL JOIN (SELECT t1_1.vkey FROM (SELECT intermediate_result.vkey FROM read_intermediate_result('10_1'::text, 'binary'::citus_copy_format) intermediate_result(vkey integer)) t1_1) t1 ON ((t2.vkey OPERATOR(pg_catalog.=) t1.vkey)))
DEBUG: Creating router plan
┌──────┬──────┐
│ vkey │ vkey │
├──────┼──────┤
│ 5 │ │
└──────┴──────┘
(1 row)
In other words, in that case, we first convert the distributed table into an intermediate result and everything goes well. However, when the where clause that is mentioned above is introduced, then ShouldRecursivelyPlanOuterJoins() returns false and so we don't call into RecursivelyPlanRecurringTupleOuterJoinWalker() function, which we use to convert distributed tables into intermediate results.
So, either standard_planner() or a prior part of the Citus planner seems to convert the outer join mentioned above into some other format and that causes us incorrectly planning the query.
@codeforall & me will debug this a bit to see what's happening here.
(cc: @colm-mchugh)
@onurctirtir this problem (and both #7697 and #7696) does not occur with Citus 13 and on investigation the problem is with Postgres 16, because of pg commit 695f5deb79, which prevents the set_join_pathlist_hook from being called if any of the join restrictions is a pseudo-constant. In #7698 on Citus 12, Postgres 16 does not call the join hook because the right outer join fails the consider_join_pushdown check in add_paths_to_joinrel() so citus has no info on the join, never sees that the query has an outer join, and ends up producing an incorrect plan. Pg commit 9e9931d re-enables unconditional calling of the set_join_pathlist_hook, and is in Postgres 17, so this problem (and #7697 and #7696) does not occur with Citus 13 because citus has visibility on all the joins and can make the decisions you describe in your post. So it seems to be a somewhat unfortunate case of Citus 12 being impacted by a restriction on the calling of set_join_pathlist_hook in Postgres 16. Getting the pg commit (sha 9e9931d) that removed the restriction into Postgres 16 would have prevented this, but was not to be.
cc @codeforall sharing this update in case its useful to your PR
This issue can be fixed by upgrading to Postgres 17.
For Postgres 15 and 16, there is no access to set_join_pathlist_hook, therefore to prevent incorrect query results, in the next version Citus will throw an error for such type of queries with pseudoconstant conditions (like exists (select * from t3))