datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[DISCUSSION] JOIN "task force" / project team

Open alamb opened this issue 7 months ago • 12 comments

What I see (what problem we are trying to solve)

DataFusion's current join implementations are fairly basic. They are functional enough to run TPCH and TPC-DS, but lack other features such as larger-than-memory processing, ASOF joins, complete subquery support and more.

There seems to be a non trivial desire in the community to improve this.

Some examples of issues / tickets related to enhanced join support / features:

Subqueries (which are implemented as joins)

  • [ ] https://github.com/apache/datafusion/issues/5483
  • [ ] https://github.com/apache/datafusion/issues/5492
  • [ ] https://github.com/apache/datafusion/issues/14554

Join Features

  • [ ] https://github.com/apache/datafusion/issues/12454
  • [ ] https://github.com/apache/datafusion/issues/15784
  • [ ] https://github.com/apache/datafusion/issues/14239
  • [ ] https://github.com/apache/datafusion/issues/14238
  • [ ] https://github.com/apache/datafusion/issues/13765
  • [ ] https://github.com/apache/datafusion/issues/13003
  • [ ] https://github.com/apache/datafusion/issues/12952
  • [ ] https://github.com/apache/datafusion/issues/10048
  • [ ] https://github.com/apache/datafusion/issues/3843

Specialized Joins

  • [x] https://github.com/apache/datafusion/issues/9846
  • [ ] https://github.com/apache/datafusion/issues/318
  • [x] https://github.com/apache/datafusion/issues/13471
  • [ ] https://github.com/apache/datafusion/issues/13232
  • [ ] https://github.com/apache/datafusion/issues/13181
  • [ ] https://github.com/apache/datafusion/issues/13138

Performance

  • [ ] https://github.com/apache/datafusion/issues/15382
  • [ ] https://github.com/apache/datafusion/issues/7955
  • [ ] https://github.com/apache/datafusion/issues/14758
  • [ ] https://github.com/apache/datafusion/issues/13620

What is blocking significant forward progress

In my mind, the major challenge is that "improving" JOINs can get arbitrarily complicated. There are dozens of academic paper each year on various aspects of join implemnetations, and designing / implementing join capabilities is a substantial engineering effort.

I spent 6 years of my life doing joins at Vertica where they accounted for around 50% of the optimizer's complexity, to give some sense

I don't think the issue is that any particular feature is super complicated to understand, but defining the overall goal, the framework that will accomodate the goal, and then breaking it down into implementable pieces itself I think will require both specialized knowledge and substantial time.

What I suggest

I suggest that people with the relevant skills and time to invest gather together to drive this process worward

  1. plan out a "join roadmap" (aka prioritize what join features they will push forward)
  2. Figure out what, if any, new structures are in place
  3. Start breaking it down into smaller tickets I can't personally lead such an effort, but I am filing this ticket to try and help connect the relevant people in the community that can.

Some potential people that could help (sorry if I didn't list you)

  • @duongcongtoai -- the discussion on https://github.com/apache/datafusion/issues/14554#issuecomment-2798943345
  • @xudong963 who has experience in this area
  • @Dandandan @comphead and @korowa who contributed substantially to the existing joins
  • @mingmwang and @jackwener who contributed significantly to the original subquery implementation
  • @liukun4515 who likewise helped significantly
  • @suibianwanwank

Related content:

Related blogs (join ordering section in part 2): https://www.influxdata.com/blog/optimizing-sql-dataframes-part-two/

alamb avatar Apr 28 '25 21:04 alamb

not sure if it will help direction, cost nothing to share :) Debunking the Myth of Join Ordering: Toward Robust SQL Analytics

milenkovicm avatar Apr 29 '25 13:04 milenkovicm

not sure if it will help direction, cost nothing to share :) Debunking the Myth of Join Ordering: Toward Robust SQL Analytics

I have that paper on my reading list. Does anyone know of a production system that has implemented the RPT framework?

alamb avatar Apr 29 '25 21:04 alamb

not sure if it will help direction, cost nothing to share :) Debunking the Myth of Join Ordering: Toward Robust SQL Analytics

I have that paper on my reading list. Does anyone know of a production system that has implemented the RPT framework?

DuckDB is integrating it https://github.com/duckdb/duckdb/pull/17326

2010YOUY01 avatar May 02 '25 15:05 2010YOUY01

not sure if it will help direction, cost nothing to share :) Debunking the Myth of Join Ordering: Toward Robust SQL Analytics

I have that paper on my reading list. Does anyone know of a production system that has implemented the RPT framework?

DuckDB is integrating it duckdb/duckdb#17326

Cool, looking forward to seeing the final result

xudong963 avatar May 04 '25 15:05 xudong963

In case others haven't heard, @irenjj is working on additional subquery support as part of a Google Summer of Code Project (where @jayzhan211 and I are helping mentor).

I am not quite sure what our next steps will be here

alamb avatar May 15 '25 12:05 alamb

Update:

  • @irenjj is tracking his project here: https://github.com/apache/datafusion/issues/16059

I think the first thing we will be focusing on is a more general framework for decorrelation

  • https://github.com/apache/datafusion/issues/5492

alamb avatar May 18 '25 13:05 alamb

I have seen @irenjj / @duongcongtoai / @UBarney / @jonathanc-n working on this

@xudong963 (a DataFusion committer and PMC member) has experience in this area and may have additional time to help out, coordinate and review PRs.

alamb avatar Jun 13 '25 16:06 alamb

FYI, I just followed the latest paper from TUM, "Improving Unnesting of Complex Queries", and will learn about the current code in DF and read the latest PRs in DF about unnesting subquery

xudong963 avatar Jun 15 '25 11:06 xudong963

FYI, I just followed the latest paper from TUM, "Improving Unnesting of Complex Queries", and will learn about the current code in DF and read the latest PRs in DF about unnesting subquery

Since I think DataBend is open source https://github.com/databendlabs/databend perhaps you can point us at the relevant code in that codebase that would be a good source of study?

alamb avatar Jun 16 '25 11:06 alamb

Since I think DataBend is open source https://github.com/databendlabs/databend perhaps you can point us at the relevant code in that codebase that would be a good source of study?

Sure, the dir contains all subquery unnesting code: https://github.com/databendlabs/databend/tree/main/src/query/sql/src/planner/optimizer/optimizers/operator/decorrelate

And this is the entry point of SubqueryDecorrelatorOptimizer rule: https://github.com/databendlabs/databend/blob/main/src/query/sql/src/planner/optimizer/optimizers/operator/decorrelate/subquery_decorrelator.rs#L802

xudong963 avatar Jun 16 '25 11:06 xudong963

Thanks @xudong963 and @alamb , we are trying to follow DuckDB's implementation: transform dependent join, flatten dependent join, eliminate delimjoin. I will spend time this weekend looking at Databend's implementation, maybe this will bring new ideas!

irenjj avatar Jun 17 '25 11:06 irenjj

DuckDB's implementation: transform dependent join, flatten dependent join, eliminate delimjoin

I believe we're almost same, expect the implementation of datebend doesn't define dependent join explicitly

xudong963 avatar Jun 18 '25 12:06 xudong963