datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[EPIC] Subquery support

Open andygrove opened this issue 3 years ago • 2 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

We have many issues now for subquery support, with some duplication so I thought it would be good to create one epic to track all of this work.

There is also a design document.

Logical Plan

  • [x] https://github.com/apache/arrow-datafusion/issues/2247
  • [x] https://github.com/apache/arrow-datafusion/issues/2245
  • [x] https://github.com/apache/arrow-datafusion/issues/2181
  • [x] https://github.com/apache/arrow-datafusion/issues/2337
  • [x] https://github.com/apache/arrow-datafusion/pull/2342

SQL Query Planner

  • [x] https://github.com/apache/arrow-datafusion/issues/2238
  • [x] https://github.com/apache/arrow-datafusion/issues/2219
  • [x] https://github.com/apache/arrow-datafusion/issues/2237
  • [x] https://github.com/apache/arrow-datafusion/issues/2353

Logical Plan Optimizations

  • [ ] https://github.com/apache/arrow-datafusion/issues/2361
  • [ ] https://github.com/apache/arrow-datafusion/issues/2351
  • [x] https://github.com/apache/arrow-datafusion/issues/488

Physical Plan

  • [ ] https://github.com/apache/arrow-datafusion/issues/1835
  • [ ] https://github.com/apache/arrow-datafusion/issues/123

Ballista

  • [x] https://github.com/apache/arrow-datafusion/issues/2338

andygrove avatar Apr 16 '22 15:04 andygrove

Some additional ideas for subquery optimizations:

  • If rest of subquery is not correlated, push up correlated filter (in particular, correlated equality filter) into a left join on the filter.
  • More here: https://www.alibabacloud.com/blog/query-optimization-technology-for-correlated-subqueries_597644

More generally, one can attempt this approach, which is a lot less messy than relying on countless rewriting heuristics (which is also highly error-prone):

  • Unnesting arbitrary queries
  • Implementation in DuckDB here: https://github.com/duckdb/duckdb/blob/bee8017bdcc5e652aee26ce8cfb260990cf6a369/src/planner/subquery/flatten_dependent_join.cpp#L72

jon-chuang avatar May 01 '22 20:05 jon-chuang

  • [ ] https://github.com/apache/arrow-datafusion/issues/2480

jon-chuang avatar May 07 '22 20:05 jon-chuang

I filed https://github.com/apache/arrow-datafusion/issues/5483 to track the few items that are not done yet, and am going to close this ticket as we have subquery support in DataFusion now 🎉

alamb avatar Mar 05 '23 12:03 alamb