materialize icon indicating copy to clipboard operation
materialize copied to clipboard

Join ordering should favor pushed-down filters

Open ggevay opened this issue 2 years ago • 1 comments

Feature request

From TPC-H, it seems we could improve join ordering by taking into account which inputs are filtered. For example,

  • https://github.com/MaterializeInc/materialize/issues/14501#issuecomment-1230958447
  • https://www.notion.so/materialize/TPC-H-Optimization-and-Dataflow-Performance-d43aa9cb379d46ccaa0be07e83d658db#35444158146b481da2ccf975acdfce44
  • (We could gather further examples here when you see one. @vmarcos, @frankmcsherry)

This is a good combination with https://github.com/MaterializeInc/materialize/issues/14501, as noted by Frank there.

I'm not sure which of our current heuristics this should override (if any). The current heuristics (see join_implementation.rs::Characteristics) are unique_key, key_length, arranged, in this order. Maybe after unique_key? We should try running all TPC-H queries with several variations. See also Notion discussion.

We can also think about the relative importance of different filter types. For example, the order could be: literal equality, literal inequality, other filter (e.g. like as in TPC-H Q09). (Both Q02 and Q03 feature 2 of these 3 categories.)

When implementing this, we should pay attention that a pushed-down filter can be represented in different ways:

  • sometimes already looks pushed down in MIR
  • but sometimes the MIR misleadingly shows it to be after the join, and it will get into initial_closure or join_closures during MIR->LIR lowering.

Update: It turns out that there are two different join orders:

  • Above I was thinking about the orders inside each delta path.
  • There is also an ordering between the delta paths, which is relevant due to that optimization where snapshots (hydration, select queries) are given to only one of the delta paths (because all the others would just throw away the data somewhere inside the paths). See https://github.com/MaterializeInc/materialize/issues/6789

We probably want to have the above heuristic for both of these orders, because we don't want a big mismatch between join performance for snapshots and then the long-running queries.

Open Questions: Currently the user can often influence the join order by just ordering the tables in the from clause herself. This is because our current heuristics often don't favor one order over an other, and in those cases the ordering defaults to the ordering in the from clause. However, if we add one more heuristic, then this possibility of manual control is reduced. Maybe we could have a switch for power users to override our join ordering heuristics by the manual order? How would such a switch look like? Edit: @frankmcsherry is also concerned about this: https://github.com/MaterializeInc/materialize/issues/14501#issuecomment-1236100844

ggevay avatar Aug 30 '22 12:08 ggevay

I'm not sure which of our current heuristics this should override (if any).

I would say "none"! The changes we've been discussing making are to the order of inputs for a delta join, which (other than permuting the output columns) one can freely rearrange.

frankmcsherry avatar Aug 30 '22 19:08 frankmcsherry