materialize icon indicating copy to clipboard operation
materialize copied to clipboard

Split OR Filters to UNIONs to reduce size of cross joins

Open wangandi opened this issue 5 years ago • 3 comments

From a prospect query:

select foo.b
from foo, bar 
where foo.a = 'X'
  or foo.a = bar.a
  or bar.a = 'Y'

It plans something like

 %0 =                                                  +
 | Get materialize.public.foo (u1)                 +
 | ArrangeBy ()                                        +
                                                       +
 %1 =                                                  +
 | Get materialize.public.bar (u1)                 +
                                                       +
 %2 =                                                  +
 | Join %0 %1                                          +
 | | implementation = Differential %1 %0.()            +
 | | demand = (#0..#2)                            +
 | Filter (((#2 = "Y") || (#0 = #2)) || (#0 = "X"))+
 | Project(#1)

This will likely be less memory intensive if converted to

select foo.b
from foo, bar 
where foo.a = 'X'
UNION ALL
select foo.b
from foo, bar 
where foo.a = bar.a
UNION ALL
select foo.b
from foo, bar 
bar.a = 'Y'

wangandi avatar Sep 16 '20 17:09 wangandi

I just filed (and closed) a basically identical issue. The only thing worth carrying over is that for correctness it is important to include the complement of the preceding disjunctions, so for example

select foo.b
from foo, bar 
where foo.a = 'X'
UNION ALL
select foo.b
from foo, bar 
where foo.a = bar.a 
  AND NOT foo.a = 'X'    -- important
UNION ALL
select foo.b
from foo, bar 
where bar.a = 'Y'
  AND NOT foo.a = 'X'    -- important
  AND NOT foo.a = bar.a  -- important

There are plenty of cases where these could be optimized away, including possibly this case idk, but generally we want to avoid double counting and complementing prior disjunctions is one way to do that.

frankmcsherry avatar Aug 05 '21 15:08 frankmcsherry

Also mentioned in https://github.com/MaterializeInc/materialize/issues/7409

ggevay avatar Sep 08 '22 17:09 ggevay

This issue is also seen in TPC-H Query 16.

wangandi avatar Sep 19 '22 18:09 wangandi

One more instance here in a user query: https://materializeinc.slack.com/archives/C02PPB50ZHS/p1675867461671029

ggevay avatar Feb 09 '23 14:02 ggevay

(Might also help here: https://github.com/MaterializeInc/materialize/issues/19437, but it enables pushdowns into sources instead of eliminating cross joins.)

ggevay avatar May 24 '23 12:05 ggevay

One more situation where the decomposition of an OR in a MirScalarExpr expr to a UNION in MirRelationExpr would help (came from @sthm):

The following doesn't compile, because mz_now is in the SELECT:

CREATE VIEW foo AS
SELECT
	...

	-- record that the end time needs adjustment in the final query
	CASE WHEN
        pr_end IS NULL
        AND mz_now() < sh_end
        THEN 1
	ELSE 0
     END 
AS shifted_end

FROM shs, g_logs
WHERE ...

We could automatically decompose the CASE into a UNION, but then there is another problem: This comes out to

SELECT ... 1 FROM ... WHERE prod_end IS NULL AND mz_now() < shift_end
UNION
SELECT ... 0 FROM ... WHERE not(prod_end IS NULL AND mz_now() < shift_end)

where mz_now is inside an OR (after we DeMorgan the not(... and ...)), which is again not allowed. However, we could further decompose: this time the OR would be made into a UNION.

ggevay avatar Jun 01 '23 10:06 ggevay

This would also help the NOT IN issue (https://github.com/MaterializeInc/materialize/issues/1137).

ggevay avatar Sep 23 '23 10:09 ggevay