project-m36 icon indicating copy to clipboard operation
project-m36 copied to clipboard

Fast three way join algorithm

Open jmatsushita opened this issue 5 years ago • 8 comments

Hi there, I just watched this talk https://youtu.be/FudEqfutW2A?t=1888 from MSFP by @henglein and @mkm about how to use module theory to get a very fast implementation of three way joins which probably has more implications with regards to database performance (with a haskell implementation which I think isn't public yet).

In the q&a section there are questions about backporting this approach to RDMS, and generally make this result more visible to the database community, and I thought this might be an idea that would be a great fit for project-m36.

Hope this is interesting 😄

jmatsushita avatar Sep 04 '20 18:09 jmatsushita

Interesting, indeed! Thanks for the link. I have contacted the authors of the paper for more details.

agentm avatar Sep 05 '20 17:09 agentm

The paper can be found here.

The paper also mentions "Factorized Databases", a tree-based representations of joins which purports to provide an redundancy-free representation.

agentm avatar Sep 06 '20 16:09 agentm

Project:M36 already has a notion similar to factorized joins in nested relations. It may be interesting to implement a special form of join which generates nested relation results.

agentm avatar Sep 06 '20 16:09 agentm

After reviewing the paper and the video, it's not immediately clear but strongly implied that this optimization applies only to the "triangle join" query which finds triples that reference each other circularly. The video and paper also do not conclude whether this optimization applies to n-ary loop searches.

I'm still waiting to hear back from the authors, if only to receive the haskell code used in the benchmark.

agentm avatar Sep 07 '20 02:09 agentm

What I understand is that the triangle join is the only experimental result so far, but that its possible that the intermediary representation would optimise other type of queries. I suspect that it hinges on the representation of negative intermediary values which allow terms to be eliminated without having to compute them, so I conjecture that it would apply to n-ary loops, though like you I havent seen it in the paper (which you will have noted is an extended abstract, so I imagine the authors are still working on it).

jmatsushita avatar Sep 07 '20 05:09 jmatsushita

It's great to see that our work is generating interest! As you have already surmised, it is indeed a work in progress, but I hope to have a publicly available prototype Soon™.

The current epistemic status is as follows:

  • We are sure that certain classes of queries, including cyclic joins, are done optimally.
  • We are confident, and in the process of proving, that all conjunctive queries are done optimally.
  • We speculate that even functional aggregate queries can be done optimally with suitable modifications. Here "optimal" refers to worst-case output optimality.

Note that it's not just a join algorithm, it's a query evaluation approach that happens to deal with join situations (whether explicit joins or cartesian products followed by filtering) efficiently. Of course, you can use it as a special-purpose method to deal with, say, three-way joins, but you won't reap the full benefits and the resulting query evaluator will probably still have some blind spots. Exactly how much of the module-theoretic approach can be grafted onto a relational algebra model remains to be seen, so I'm curious to see how far this goes. Depending on which flavour of relational algebra you are using, you might actually be working with semi-modules already!

mkm avatar Sep 07 '20 12:09 mkm

There is prior workhttps://link.springer.com/article/10.1007/s10990-011-9078-8 using symbolic products and discriminatory joins, predating the module/(tensor) algebra generalization presented by Mikkel and not covering efficient cyclic joins. As it has with full Haskell code (in the article) for all standard SQL query constructs, generalized to non-normal form, I hope it can be of interest. The PDF of the article is included FYC.

Best, Fritz

On 7 Sep 2020, at 14:26, Mikkel Kragh Mathiesen <[email protected]mailto:[email protected]> wrote:

It's great to see that our work is generating interest! As you have already surmised, it is indeed a work in progress, but I hope to have a publicly available prototype Soon™.

The current epistemic status is as follows:

  • We are sure that certain classes of queries, including cyclic joins, are done optimally.
  • We are confident, and in the process of proving, that all conjunctive queries are done optimally.
  • We speculate that even functional aggregate queries can be done optimally with suitable modifications. Here "optimal" refers to worst-case output optimality.

Note that it's not just a join algorithm, it's a query evaluation approach that happens to deal with join situations (whether explicit joins or cartesian products followed by filtering) efficiently. Of course, you can use it as a special-purpose method to deal with, say, three-way joins, but you won't reap the full benefits and the resulting query evaluator will probably still have some blind spots. Exactly how much of the module-theoretic approach can be grafted onto a relational algebra model remains to be seen, so I'm curious to see how far this goes. Depending on which flavour of relational algebra you are using, you might actually be working with semi-modules already!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fagentm%2Fproject-m36%2Fissues%2F265%23issuecomment-688291720&data=02%7C01%7Chenglein%40di.ku.dk%7C3b21e625dfa24e1efa4908d853292f12%7Ca3927f91cda14696af898c9f1ceffa91%7C0%7C0%7C637350783638935588&sdata=vqtffTG783hc1UwT1wraZUpE%2FBXMJ7TTJytorJw0HoE%3D&reserved=0, or unsubscribehttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAFDZD54M2PT2ULKIUKBDDLSETGNRANCNFSM4QZCBPXA&data=02%7C01%7Chenglein%40di.ku.dk%7C3b21e625dfa24e1efa4908d853292f12%7Ca3927f91cda14696af898c9f1ceffa91%7C0%7C0%7C637350783638935588&sdata=v6PZG8APja6UfrXtH4tflV%2FEuFUv7nQ4VPZSmJOtqJ8%3D&reserved=0.

henglein avatar Sep 11 '20 05:09 henglein

I now have a prototype publicly available at https://github.com/mkm/module-theory. There are examples of three-way joins as well as more complicated joins.

mkm avatar Oct 05 '20 14:10 mkm