datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Eliminate Self Joins

Open atahanyorganci opened this issue 8 months ago • 3 comments

This PR implents two EliminateAggregationSelfJoin and EliminateUniqueKeyedSelfJoin optimization rules. As the name suggests these optimizations eliminate self joins when the expression returned can be rewritten more efficently with a window function or removed entirely respectively.

EliminateUniqueKeyedSelfJoin

In an inner join when joined column forms a unique index on the table a single row is returned.

High level overview of conditions that make this optimization possible is both sides of the join refer to same table, then joined columns when combined must form a unique index on the shared table. If any join filter is specified whether query produces a single row cannot be determined, as filter expression may return false.

Optimization Step

Sample optimizable query with employees table and id as the primary key.

SELECT a.id
FROM employees a
JOIN employees b ON a.id = b.id
WHERE b.department = 'HR';

Unoptimized logical plan.

Projection: a.id
  Filter: b.department = Utf8("HR")
    Inner Join:  Filter: a.id = b.id
      SubqueryAlias: a
        TableScan: employees
      SubqueryAlias: b
        TableScan: employees

The query can be optimized by removing duplicate TableScan and SubqueryAlias nodes.

Projection: a.id
  Filter: a.department = Utf8("HR")
    SubqueryAlias: a
      TableScan: employees

## EliminateAggregationSelfJoin

Following sample query with self-join can be used to gather analytics on product sales.

SELECT a.user_id, a.purchase_date, SUM(b.amount) AS running_total
FROM purchases a
JOIN purchases b ON a.user_id = b.user_id AND b.purchase_date <= a.purchase_date
GROUP BY a.user_id, a.purchase_date;

This query can more efficiently rewritten with a window expression.

SELECT
  user_id,
  purchase_date,
  SUM(amount) OVER (PARTITION BY user_idORDER BY purchase_date ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM
  purchases;

Without going into why this works, the conditions where this optimization can be applied is

  • It should be inner join between same tables, a.k.a self-join
  • JOIN ON <join columns> AND <filter expression>
    • Joined columns should NOT form a unique index
    • Filter expression should be binary expression with a comparison operator and should refer to the same column
  • Columns in JOIN ON ... should be the same as GROUP BY

These conditions conservatively allow self-join to be replaced with a window expression.

Closes: #14758

atahanyorganci avatar May 11 '25 22:05 atahanyorganci

FYI @irenjj

alamb avatar May 12 '25 17:05 alamb

Current implementation fails in call to assert_valid_optimization where schema of a optimization pass is compared against the previous state. Failure occurs when alias is replaced with an other. For example

SELECT
    b.id
FROM
    employees a
    JOIN employees b USING (id)
WHERE
    b.department = 'HR'

Since alias b won't exist after the optimization it is replaced with a. This causes TableReferences to not be the same and the invariant fails eventhough both of the alises refer to the same table.

atahanyorganci avatar May 28 '25 14:05 atahanyorganci

I will try to take a look at this this weekend @atahanyorganci

jonathanc-n avatar Jun 14 '25 17:06 jonathanc-n

@atahanyorganci Hello, would you still be interested in continuing with this?

jonathanc-n avatar Jun 27 '25 05:06 jonathanc-n

@atahanyorganci Hello, would you still be interested in continuing with this?

I’ll drive this to completion tomorrow.

berkaysynnada avatar Jul 17 '25 16:07 berkaysynnada

@berkaysynnada I'll be happy to review it

jonathanc-n avatar Jul 17 '25 16:07 jonathanc-n

@alamb @xudong963 @jonathanc-n If you have some time, could you please review this PR (or tag others who might be interested)?

I worked closely with @atahanyorganci on this, and I believe these two rules provide a good foundation for eliminating self-joins. This area of optimization has a big potential for further improvements.

To avoid introducing regressions or bugs, I’ve taken a very conservative approach. Optimizations are only applied when specific patterns are clearly matched. I’ve also tried to include tests from multiple perspectives. Thanks in advance

berkaysynnada avatar Jul 30 '25 10:07 berkaysynnada

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Oct 08 '25 02:10 github-actions[bot]