dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

[ENH] Expand inner/left/right join logic to properly handle joins on non-referenced columns

Open charlesbluca opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe. Currently, when faced with an inner/left/right join condition that does not compare directly referenced columns (for example, if one of the columns is the result of a cast operation), we opt to instead to do an cross join and then filter the result by the computed join condiiton(s).

Given that cross joins raise a warning around resource intensiveness, I'm wondering if this could create a performance bottleneck for otherwise simple queries:

import pandas as pd
from dask_sql import Context

c = Context()
c.create_table("df1", pd.DataFrame({"a": [1, 2, 3, 4, 5]}))
c.create_table("df2", pd.DataFrame({"b": [1.0, 2.0, 3.0, 4.0, 5.0]}))

df = c.sql("SELECT * FROM df1 JOIN df2 ON a=b")

Describe the solution you'd like I'd be interested in refactoring our join code so that _extract_lhs_rhs has handling for operands beyond just RexType.References, and reflecting that change across the remainder of the join code so that we pre-compute RexType.Call columns prior to the join operation and use them in an inner/left/right join rather than as filters for a cross join.

Additional context Was thinking about this while writing out a test for https://github.com/dask-contrib/dask-sql/pull/1153#discussion_r1207032009; this was initially a bit confusing because the EXPLAIN for the above query implies that an inner join would be executed, when in actuality we run through the cross join codepath on the Python side.

charlesbluca avatar May 26 '23 16:05 charlesbluca