noria
noria copied to clipboard
Be smarter about n-way joins
We currently implement N-way joins by doing a chain of two-way joins. While this does help somewhat for re-use, we could be smarter about it. For example:
A JOIN B JOIN C JOIN D
will currently be turned into a chain of
((A JOIN B) JOIN C) JOIN D
whereas we could turn it into
(A JOIN B) JOIN (C JOIN D)
which allow more parallelism in the data-flow. We could also take into account the size of the related tables (and even key density) to determine which join order allows us to do less work.
This work could extend to adding direct support for an N-way join, as those may be possible to execute more efficiently than lots of 2-way joins. We may want to keep a separate N-way join operator though, as its code will likely be more complicated, and slower, than the 2-way join code.
Relevant implementations of worst case generic n-way join: https://github.com/frankmcsherry/dataflow-join https://github.com/HazyResearch/EmptyHeaded
I don't know how relevant this still is, or if this is the right place, but there is some research from the data management community into worst-case optimal join algorithms that might help you out here.
Here's a general overview paper into the topic: Overview paper: https://dl.acm.org/doi/pdf/10.1145/2590989.2590991
Here's an example of a worst-case optimal join algorithm that seems particularly well suited to Rust, given that it heavily relies on iterator interfaces between everything: Leapfrog Trie-join: https://openproceedings.org/2014/conf/icdt/Veldhuizen14.pdf
I'm not sure how well this would apply to dataflows, but it may be worth looking into.
edit: I just saw the other two links above, and Generic-Join is also worst-case optimal.
The biggest challenge we have is that the join algorithm has to be incremental. Since it's a dataflow system, this means only certain subsets of the more general join problem really applies. The join is given one batch of rows from a single "side" of the join, and must compute the fully joined results through queries into the other sides of the join based only on that batch (and statistics it may have collected). Also, everything before the join has already been computed, so algorithms that try to be cleverer about how they access base tables and such do not really apply. The join can mainly just choose the order in which it queries its ancestors. Which may mean there aren't that many interesting decisions to make, I'm not sure?
The biggest challenge we have is that the join algorithm has to be incremental. Since it's a dataflow system, this means only certain subsets of the more general join problem really applies. The join is given one batch of rows from a single "side" of the join, and must compute the fully joined results through queries into the other sides of the join based only on that batch (and statistics it may have collected). Also, everything before the join has already been computed, so algorithms that try to be cleverer about how they access base tables and such do not really apply. The join can mainly just choose the order in which it queries its ancestors. Which may mean there aren't that many interesting decisions to make, I'm not sure?
The main thing that both Generic-join and Leapfrog Trie-join (LTJ) do well is not joining pairwise (i.e. optimally abusing the fact that there's a multiway join happening), and optimally avoiding skew. It doesn't really rely on being able to order previous operators or anything like that.
At least, the Generic-join doesn't rely on anything fancy pertaining to table access. Though LTJ does require an B-tree or similar index on each table, creating and removing an iterator on each table, and building a trie of all the tuples which may or may not be a big hassle.
So yeah at face value, I think the earlier linked dataflow join (https://github.com/frankmcsherry/dataflow-join) based on generic-join might be the best option given that it requires less intrusive changes, while still providing the same benefits of being worst-case optimal.
The fact that there's only a subset of the join problem to be solved is good of course, but it doesn't take away that some of the tables this subset is getting joined against may be heavily skewed or cyclical.