dask-sql
dask-sql copied to clipboard
Distributed SQL Engine in Python using Dask
This PR changes the implementation of `DaskFunctions` to support overloaded UDF definitions: - the `return_type` attribute has been replaced with `return_types`, a `HashMap`, mapping the potential input types of a...
Closes #608 Blocked by: https://github.com/rapidsai/cudf/issues/11515 Note: currently, performing multiple aggregations at once seems to result in incorrect values. Ex: `SELECT STDDEV(a) AS s1, STDDEV_POP(a) AS s2 FROM df` returns the...
#629 Implemented STDDEV_POP on cpu, but it currently fails on gpu due to: https://github.com/rapidsai/cudf/issues/11515#issuecomment-1212305118
Previously, we would get a `ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.` for something like: ``` from dask_sql import Context import...
I'm struggling to find a programmatic reproducer for this, but on the datafusion-sql-planner branch: ``` c.sql("SELECT * FROM large_table limit 5") ``` results in reading the entire dataset before filtering...
Repro: ``` import pandas as pd from dask_sql import Context c = Context() df = pd.DataFrame({"id": [0, 1, 1, 2], "val": [1, 1, 2, 1]}) c.create_table("df", df) c.sql(""" SELECT val,...
Add optimizer rules to translate subqueries to joins (when possible)