hydroflow
hydroflow copied to clipboard
Decomposing Join
flowchart TD
optimize["auto-optimize (rewrite) <tt>persist()/deltae()</tt> #347"]
joinstate["split up <tt>join()</tt> state into <tt>persist()</tt>, etc. #347"]
joinlattice["lattice types in <tt>join()</tt> (~#271)"]
joinopt["auto-optimize <tt>join()</tt> state"]
optimize --> joinopt
joinstate --> joinlattice --> joinopt
%% david["David dedalus optimizations?"]
%% rewrite_api --> david
%% optimize --> david
I think this issue hits at the overall question of like what are our edges? Currently they are essentially streams, they don't do deduping, they don't do re-ordering. So that means what does group_by/reduce look like? Currently because edges are just streams then it works with almost any closure you put in, but if we want to treat edges like sets or like lattices or like some other structure then that would greatly affect the implementation of join/groupby/reduce/many other operators.
Here's a brief summary of one take that definitely requires more discussion: edge types should always be streams by default and lattice types should fall out of applying operators.
For example, putting a stream through a shuffle()
operator would produce a new stream, but now effectively with bag semantics. If you have a downstream reduce
function that is known to have an associative + commutative function, then we know a safe rewrite can transform that to shuffle() -> fold_ac(...)
, and the shuffle can then be pushed through other operators and used to perform the aggregation with partitions, etc.
On the types side, we can focus on checking for determinism. For example, it would be a compile error if a shuffle()
d stream goes into a fold
(it must go into a fold_c
).
See #929 -- same issue
#1050 #1058
Closing this as we have lattice_bimorphism()
operator - can make separate issues for improving that