dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

[DF] Optimize out multiple TableScans for same Table

Open jdye64 opened this issue 3 years ago • 0 comments

Is your feature request related to a problem? Please describe. Certain queries exist that can cause Dask-SQL to produce a logical plan that reads the same table twice. This is un-necessary and rather just a copy of the already read table can be made.

pdf = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})

c.create_table("df1", pdf)
c.create_table("df2", pdf)

Ex: select * from df1 intersect select * from df2

Produces the LogicalPlan

Semi Join: #df1.a = #df2.a, #df1.b = #df2.b, #df1.c = #df2.c
  Projection: #df1.a, #df1.b, #df1.c    
  Aggregate: groupBy=[[#df1.a, #df1.b, #df1.c]], aggr=[[]]
      Projection: #df1.a, #df1.b, #df1.c        
          TableScan: df1 projection=Some([a, b, c])
      Projection: #df2.a, #df2.b, #df2.c
          TableScan: df2 projection=Some([a, b, c])'

Describe the solution you'd like An optimization rule should be written to detect this and only perform a single TableScan/Projection to get the data from the underlying storage location.

Describe alternatives you've considered None

Additional context None

jdye64 avatar Jun 22 '22 10:06 jdye64