dask-sql
dask-sql copied to clipboard
[DF] Add DISTRIBUTE BY to DataFusion Port
Is your feature request related to a problem? Please describe.
Previously with Calcite we had created some custom syntax logic to allow for the DISTRIBUTE BY
clause. We need to recreate that logic so queries using DataFusion can also use DISTRIBUTE BY
Describe the solution you'd like
DISTRIBUTE BY
clause working in all SQL queries.
Here is Spark's documentation for this: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-distribute-by.html
We need to figure out where to add this. DataFusion can already parse this SQL but ignores it during planning. We probably need to represent this with an extra operator in the plan:
- Distribute
- Projection
DataFusion PR: https://github.com/apache/arrow-datafusion/pull/3208