[EPIC] Subquery support
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We have many issues now for subquery support, with some duplication so I thought it would be good to create one epic to track all of this work.
There is also a design document.
Logical Plan
- [x] https://github.com/apache/arrow-datafusion/issues/2247
- [x] https://github.com/apache/arrow-datafusion/issues/2245
- [x] https://github.com/apache/arrow-datafusion/issues/2181
- [x] https://github.com/apache/arrow-datafusion/issues/2337
- [x] https://github.com/apache/arrow-datafusion/pull/2342
SQL Query Planner
- [x] https://github.com/apache/arrow-datafusion/issues/2238
- [x] https://github.com/apache/arrow-datafusion/issues/2219
- [x] https://github.com/apache/arrow-datafusion/issues/2237
- [x] https://github.com/apache/arrow-datafusion/issues/2353
Logical Plan Optimizations
- [ ] https://github.com/apache/arrow-datafusion/issues/2361
- [ ] https://github.com/apache/arrow-datafusion/issues/2351
- [x] https://github.com/apache/arrow-datafusion/issues/488
Physical Plan
- [ ] https://github.com/apache/arrow-datafusion/issues/1835
- [ ] https://github.com/apache/arrow-datafusion/issues/123
Ballista
- [x] https://github.com/apache/arrow-datafusion/issues/2338
Some additional ideas for subquery optimizations:
- If rest of subquery is not correlated, push up correlated filter (in particular, correlated equality filter) into a left join on the filter.
- More here: https://www.alibabacloud.com/blog/query-optimization-technology-for-correlated-subqueries_597644
More generally, one can attempt this approach, which is a lot less messy than relying on countless rewriting heuristics (which is also highly error-prone):
- Unnesting arbitrary queries
- Implementation in DuckDB here: https://github.com/duckdb/duckdb/blob/bee8017bdcc5e652aee26ce8cfb260990cf6a369/src/planner/subquery/flatten_dependent_join.cpp#L72
- [ ] https://github.com/apache/arrow-datafusion/issues/2480
I filed https://github.com/apache/arrow-datafusion/issues/5483 to track the few items that are not done yet, and am going to close this ticket as we have subquery support in DataFusion now 🎉