dask-sql
dask-sql copied to clipboard
Correlated subqueries
Previously, we would get a ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.
for something like:
from dask_sql import Context
import dask.dataframe as dd
import pandas as pd
c = Context()
names = ["Miracle", "Sunshine", "Pretty woman", "Handsome man", "Barbie", "Cool painting", "Black square #1000", "Mountains"]
prices = [300, 700, 2800, 2300, 250, 5000, 50, 1300]
ids = [11, 12, 13, 14, 15, 16, 17, 18]
artist_id = [1, 1, 2, 2, 3, 3, 3, 4]
paintings = dd.from_pandas(pd.DataFrame({"id": ids, "name": names, "artist_id": artist_id, "listed_price": prices}), npartitions=1)
c.create_table("paintings", paintings)
sql1 = """
SELECT name, listed_price
FROM paintings
WHERE listed_price > (
SELECT AVG(listed_price)
FROM paintings
)
"""
c.sql(sql1).compute()
Not sure if this is the way we should go about this (not generalizable enough?), but here is an initial quick fix for that example. The general idea is that since we are comparing listed_price
to a 1x1 table containing AVG(listed_price)
, the latter has to be converted to a single value by calling compute()
and with casting.
Fixes example in #320
df = pd.DataFrame({'id': [0, 1, 2], 'name': ['a', 'b', 'c'], 'val': [0, 1, 2]})
c.create_table('test', df)
c.sql("""
select name, val, id from test a
where val >
(select avg(val) from test)
""").compute()
Codecov Report
:exclamation: No coverage uploaded for pull request base (
datafusion-sql-planner@c8259b9
). Click here to learn what that means. The diff coverage isn/a
.
@@ Coverage Diff @@
## datafusion-sql-planner #683 +/- ##
=========================================================
Coverage ? 66.95%
=========================================================
Files ? 73
Lines ? 3640
Branches ? 753
=========================================================
Hits ? 2437
Misses ? 1057
Partials ? 146
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
cc @andygrove in case you have some thoughts on this
I filed an issue against DataFusion to add support for this type of query: https://github.com/apache/arrow-datafusion/issues/3266
Looks like this was resolved on the DataFusion side with https://github.com/apache/arrow-datafusion/pull/3287 !