blazingsql
blazingsql copied to clipboard
Implement inequality joins
This feature is waiting on this:
https://github.com/rapidsai/cudf/issues/2792
https://github.com/rapidsai/cudf/pull/3628
If we want inequality joins we may not be able to wait for these primitives to exist on the CUDF side. One of the issues above is closed and the other is just a request that has been open for a long time.
In the Simplicity engine (very old blazingdb) we had an approach that required significant materialization but that scaled relatively well on a single node. We stable sorted the data according to the columns that were part of the inequality joins, created an RLE representation of the larger table, then performed bounded look ups from the smaller table into the larger table to get something like a series of ranges per row that you were joined to. So for each row you end up with something like 0-5, 100-112, 300-340. If there were any equality joins you perform those first to reduce the number of rows to be analyzed during the inequality phase.
Another option that has been floated is using an AST to evaluate the join condition but this is somewhat akin to performing a cartesian join ( though not necessarily one that has to be materialized) followed by a filter step which seems like its begging to be improved.
I think we will want to do a combination of both approaches. For inequality joins, we will want to do distribution based on a order as opposed to hashes
Im not sure combination is the right way to go here. I think we need to pick hopefully one strategy for the first implemenation. I am currently reviewing literature on the subject to see what else I can find.