dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

[DF] DataFusion Missing Features & Bugs

Open andygrove opened this issue 3 years ago • 2 comments

The purpose of this issue is to track the work that we need to do in the DataFusion project to support moving the dask-sql planner to DataFusion.

High Priority Tech Debt

We need to fix some issues before we can really get started on the main features.

  • [x] https://github.com/apache/arrow-datafusion/issues/2164
  • [x] https://github.com/apache/arrow-datafusion/issues/2251
  • [x] https://github.com/apache/arrow-datafusion/issues/2247
  • [x] https://github.com/apache/arrow-datafusion/issues/2245

High Priority Features

The SQL Query Planner and Logical Plan need to implement these features. Note that DataFusion does not necessarily need to implement a physical plan for these features, so that reduces the scope of this work.

  • [x] SQL planner support for subqueries
    • [x] https://github.com/apache/arrow-datafusion/issues/2247
    • [x] https://github.com/apache/arrow-datafusion/issues/2245
    • [x] https://github.com/apache/arrow-datafusion/issues/2181
    • [x] https://github.com/apache/arrow-datafusion/issues/2337
    • [x] https://github.com/apache/arrow-datafusion/pull/2342
    • [x] https://github.com/apache/arrow-datafusion/issues/2238
    • [x] https://github.com/apache/arrow-datafusion/issues/2219
    • [x] https://github.com/apache/arrow-datafusion/issues/2237
    • [x] https://github.com/apache/arrow-datafusion/issues/2353
    • [x] https://github.com/apache/arrow-datafusion/issues/2417
  • [x] Advanced aggregate support
    • [x] https://github.com/apache/arrow-datafusion/issues/2378
    • [x] https://github.com/apache/arrow-datafusion/issues/2477
  • [x] Other
    • [x] https://github.com/apache/arrow-datafusion/issues/2380
    • [x] https://github.com/apache/arrow-datafusion/pull/2357

High Priority Bugs

These are the bugs that we are seeing when attempting to parse all the queries from our benchmark suite.

  • [ ] Type-coercion errors
    • [x] https://github.com/apache/arrow-datafusion/issues/2229
    • [x] https://github.com/apache/arrow-datafusion/issues/2420
  • [x] Schema errors
    • [x] https://github.com/apache/arrow-datafusion/issues/2372
  • [x] Subquery errors
    • [x] https://github.com/apache/arrow-datafusion/issues/2358
    • [x] https://github.com/apache/arrow-datafusion/issues/2379
    • [x] https://github.com/apache/arrow-datafusion/issues/2381
    • [x] https://github.com/apache/arrow-datafusion/issues/2415

Ongoing improvements

We do not need these for the benchmark suite but these are features and bugs that we are likely to eventually run into so it makes sense to be proactive and work on these.

  • [x] https://github.com/apache/arrow-datafusion/issues/2377
  • [x] https://github.com/apache/arrow-datafusion/issues/2496
  • [x] https://github.com/apache/arrow-datafusion/issues/2360
  • [x] https://github.com/apache/arrow-datafusion/issues/2573

Refactoring of DataFusion crates

We currently bring in the full datafusion crate as a dependency, including the physical plans and execution engine. We really should just depend on the features necessary for SQL query planning and logical plan optimization. These are the issues that need to be implemented to achieve that.

  • [x] https://github.com/apache/arrow-datafusion/issues/2345
  • [x] https://github.com/apache/arrow-datafusion/issues/2614
  • [x] https://github.com/apache/arrow-datafusion/issues/2599
  • [x] https://github.com/apache/arrow-datafusion/issues/2535

Lower Priority Tech Debt / Misc Other Items

  • [ ] https://github.com/apache/arrow-datafusion/issues/2551 (to increase our confidence in the planning and optimization rules around LIMIT and OFFSET)
  • [x] https://github.com/apache/arrow-datafusion/issues/2212
  • [x] https://github.com/apache/arrow-datafusion/issues/2213

andygrove avatar Apr 14 '22 14:04 andygrove

https://github.com/apache/arrow-datafusion/issues/2551 might also belong listed in the "other" category

alamb avatar May 24 '22 10:05 alamb

apache/arrow-datafusion#2551 might also belong listed in the "other" category

Thanks @alamb. Strictly speaking, we're not blocked on this because we are only using the SQL query planner and the logical optimization rules. However, implementing the physical plan for OFFSET and adding some integration tests would give me greater confidence that the planning and optimization rules for LIMIT and OFFSET are correct, so I will add it to the list.

andygrove avatar May 25 '22 16:05 andygrove