dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

Correlated subqueries

Open sarahyurick opened this issue 2 years ago • 2 comments

Previously, we would get a ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index. for something like:

from dask_sql import Context
import dask.dataframe as dd
import pandas as pd

c = Context()

names = ["Miracle", "Sunshine", "Pretty woman", "Handsome man", "Barbie", "Cool painting", "Black square #1000", "Mountains"]
prices = [300, 700, 2800, 2300, 250, 5000, 50, 1300]
ids = [11, 12, 13, 14, 15, 16, 17, 18]
artist_id = [1, 1, 2, 2, 3, 3, 3, 4]
paintings = dd.from_pandas(pd.DataFrame({"id": ids, "name": names, "artist_id": artist_id, "listed_price": prices}), npartitions=1)
c.create_table("paintings", paintings)

sql1 = """
SELECT name, listed_price
FROM paintings
WHERE listed_price > (
    SELECT AVG(listed_price)
    FROM paintings
)
"""
c.sql(sql1).compute()

Not sure if this is the way we should go about this (not generalizable enough?), but here is an initial quick fix for that example. The general idea is that since we are comparing listed_price to a 1x1 table containing AVG(listed_price), the latter has to be converted to a single value by calling compute() and with casting.

sarahyurick avatar Aug 11 '22 23:08 sarahyurick

Fixes example in #320

df = pd.DataFrame({'id': [0, 1, 2], 'name': ['a', 'b', 'c'], 'val': [0, 1, 2]})

c.create_table('test', df)
c.sql("""
select name, val, id from test a
where val >
  (select avg(val) from test)
""").compute()

sarahyurick avatar Aug 11 '22 23:08 sarahyurick

Codecov Report

:exclamation: No coverage uploaded for pull request base (datafusion-sql-planner@c8259b9). Click here to learn what that means. The diff coverage is n/a.

@@                    Coverage Diff                    @@
##             datafusion-sql-planner     #683   +/-   ##
=========================================================
  Coverage                          ?   66.95%           
=========================================================
  Files                             ?       73           
  Lines                             ?     3640           
  Branches                          ?      753           
=========================================================
  Hits                              ?     2437           
  Misses                            ?     1057           
  Partials                          ?      146           

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Aug 11 '22 23:08 codecov-commenter

cc @andygrove in case you have some thoughts on this

I filed an issue against DataFusion to add support for this type of query: https://github.com/apache/arrow-datafusion/issues/3266

andygrove avatar Aug 25 '22 16:08 andygrove

Looks like this was resolved on the DataFusion side with https://github.com/apache/arrow-datafusion/pull/3287 !

sarahyurick avatar Sep 08 '22 05:09 sarahyurick