Split OR Filters to UNIONs to reduce size of cross joins
From a prospect query:
select foo.b
from foo, bar
where foo.a = 'X'
or foo.a = bar.a
or bar.a = 'Y'
It plans something like
%0 = +
| Get materialize.public.foo (u1) +
| ArrangeBy () +
+
%1 = +
| Get materialize.public.bar (u1) +
+
%2 = +
| Join %0 %1 +
| | implementation = Differential %1 %0.() +
| | demand = (#0..#2) +
| Filter (((#2 = "Y") || (#0 = #2)) || (#0 = "X"))+
| Project(#1)
This will likely be less memory intensive if converted to
select foo.b
from foo, bar
where foo.a = 'X'
UNION ALL
select foo.b
from foo, bar
where foo.a = bar.a
UNION ALL
select foo.b
from foo, bar
bar.a = 'Y'
I just filed (and closed) a basically identical issue. The only thing worth carrying over is that for correctness it is important to include the complement of the preceding disjunctions, so for example
select foo.b
from foo, bar
where foo.a = 'X'
UNION ALL
select foo.b
from foo, bar
where foo.a = bar.a
AND NOT foo.a = 'X' -- important
UNION ALL
select foo.b
from foo, bar
where bar.a = 'Y'
AND NOT foo.a = 'X' -- important
AND NOT foo.a = bar.a -- important
There are plenty of cases where these could be optimized away, including possibly this case idk, but generally we want to avoid double counting and complementing prior disjunctions is one way to do that.
Also mentioned in https://github.com/MaterializeInc/materialize/issues/7409
This issue is also seen in TPC-H Query 16.
One more instance here in a user query: https://materializeinc.slack.com/archives/C02PPB50ZHS/p1675867461671029
(Might also help here: https://github.com/MaterializeInc/materialize/issues/19437, but it enables pushdowns into sources instead of eliminating cross joins.)
One more situation where the decomposition of an OR in a MirScalarExpr expr to a UNION in MirRelationExpr would help (came from @sthm):
The following doesn't compile, because mz_now is in the SELECT:
CREATE VIEW foo AS
SELECT
...
-- record that the end time needs adjustment in the final query
CASE WHEN
pr_end IS NULL
AND mz_now() < sh_end
THEN 1
ELSE 0
END
AS shifted_end
FROM shs, g_logs
WHERE ...
We could automatically decompose the CASE into a UNION, but then there is another problem: This comes out to
SELECT ... 1 FROM ... WHERE prod_end IS NULL AND mz_now() < shift_end
UNION
SELECT ... 0 FROM ... WHERE not(prod_end IS NULL AND mz_now() < shift_end)
where mz_now is inside an OR (after we DeMorgan the not(... and ...)), which is again not allowed. However, we could further decompose: this time the OR would be made into a UNION.
This would also help the NOT IN issue (https://github.com/MaterializeInc/materialize/issues/1137).