differential-datalog
differential-datalog copied to clipboard
Support left joins
Left joins are currently implemented using antijoins. Differential dataflow provides a better way.
Any progress here? cc @frankmcsherry
Nope.
I'm not sure I have all the context on the question, but you should always be able to implement a left join using inner joins and potentially either antijoins or reductions. If you can identify keys in one relation without matching keys in another relation (non-monotonic, unfortunately) then you can perform the inner join, and pad non-matching keys with the default values.
But, as a non-monotonic operation it is not natively implementable in Datalog, I believe (edit: without stratification).
DDlog does support stratified negation, so we can express left joins, but this issue was about implementing them more efficiently when possible (e.g., when a left join is followed by aggregation). This would require either implementing some form of static analysis to identify the pattern or exposing a new language primitive.
I vote for the static analysis. No point in making the language more complex than needed.
At Materialize we do this with a pretty simple static analysis that goes down from reductions (or filters sometimes) and has the form "records with NULL values in any of these fields can be dropped without affecting the results". When this gets to the union for a left join it descends both branches and .. has no effect on the inner join, but is able to zero out the half that produces NULL values for missing keys.
This would mean that you would need to have a notion of NULL (or a more sane default value; perhaps zero) and I wouldn't wish that on anyone.
That makes sense, thanks! Part of the problem is that DDlog does not have an explicit left join operation, so we would have to either introduce one or have an even smarter analysis that deduces that a fragment of the program behaves as a left join. I prefer the former.
The SQL to DDlog compiler can recognize left joins more easily. And we have a clear convention for representing NULLs too.
But that's a different problem.
Just to confirm, right now we have to make two intermediate relations with antijoin and inner join and then combine them into our final relation (or use them directly)? Or is there a more terse syntax that is possible in the meantime.
That's correct, you need two rules, there is no shortcut to do this more concisely.