blazingsql icon indicating copy to clipboard operation
blazingsql copied to clipboard

Implement inequality joins

Open wmalpica opened this issue 4 years ago • 3 comments

This feature is waiting on this: https://github.com/rapidsai/cudf/issues/2792
https://github.com/rapidsai/cudf/pull/3628

wmalpica avatar May 26 '20 22:05 wmalpica

If we want inequality joins we may not be able to wait for these primitives to exist on the CUDF side. One of the issues above is closed and the other is just a request that has been open for a long time.

In the Simplicity engine (very old blazingdb) we had an approach that required significant materialization but that scaled relatively well on a single node. We stable sorted the data according to the columns that were part of the inequality joins, created an RLE representation of the larger table, then performed bounded look ups from the smaller table into the larger table to get something like a series of ranges per row that you were joined to. So for each row you end up with something like 0-5, 100-112, 300-340. If there were any equality joins you perform those first to reduce the number of rows to be analyzed during the inequality phase.

Another option that has been floated is using an AST to evaluate the join condition but this is somewhat akin to performing a cartesian join ( though not necessarily one that has to be materialized) followed by a filter step which seems like its begging to be improved.

felipeblazing avatar Apr 19 '21 16:04 felipeblazing

I think we will want to do a combination of both approaches. For inequality joins, we will want to do distribution based on a order as opposed to hashes

wmalpica avatar Apr 19 '21 20:04 wmalpica

Im not sure combination is the right way to go here. I think we need to pick hopefully one strategy for the first implemenation. I am currently reviewing literature on the subject to see what else I can find.

felipeblazing avatar Apr 19 '21 20:04 felipeblazing