hydroflow icon indicating copy to clipboard operation
hydroflow copied to clipboard

Decomposing Join

Open jhellerstein opened this issue 1 year ago • 4 comments

flowchart TD
    optimize["auto-optimize (rewrite) <tt>persist()/deltae()</tt> #347"]
    joinstate["split up <tt>join()</tt> state into <tt>persist()</tt>, etc. #347"]
    joinlattice["lattice types in <tt>join()</tt> (~#271)"]
    joinopt["auto-optimize <tt>join()</tt> state"]

    optimize --> joinopt
    joinstate --> joinlattice --> joinopt

    %% david["David dedalus optimizations?"]
    %% rewrite_api --> david
    %% optimize --> david

jhellerstein avatar May 08 '23 17:05 jhellerstein

I think this issue hits at the overall question of like what are our edges? Currently they are essentially streams, they don't do deduping, they don't do re-ordering. So that means what does group_by/reduce look like? Currently because edges are just streams then it works with almost any closure you put in, but if we want to treat edges like sets or like lattices or like some other structure then that would greatly affect the implementation of join/groupby/reduce/many other operators.

zzlk avatar May 08 '23 17:05 zzlk

Here's a brief summary of one take that definitely requires more discussion: edge types should always be streams by default and lattice types should fall out of applying operators.

For example, putting a stream through a shuffle() operator would produce a new stream, but now effectively with bag semantics. If you have a downstream reduce function that is known to have an associative + commutative function, then we know a safe rewrite can transform that to shuffle() -> fold_ac(...), and the shuffle can then be pushed through other operators and used to perform the aggregation with partitions, etc.

On the types side, we can focus on checking for determinism. For example, it would be a compile error if a shuffle()d stream goes into a fold (it must go into a fold_c).

shadaj avatar May 14 '23 01:05 shadaj

See #929 -- same issue

jhellerstein avatar Nov 20 '23 18:11 jhellerstein

#1050 #1058

MingweiSamuel avatar Mar 26 '24 18:03 MingweiSamuel

Closing this as we have lattice_bimorphism() operator - can make separate issues for improving that

MingweiSamuel avatar Aug 12 '24 16:08 MingweiSamuel