Eliminate Self Joins
This PR implents two EliminateAggregationSelfJoin and EliminateUniqueKeyedSelfJoin optimization rules. As the name suggests these optimizations eliminate self joins when the expression returned can be rewritten more efficently with a window function or removed entirely respectively.
EliminateUniqueKeyedSelfJoin
In an inner join when joined column forms a unique index on the table a single row is returned.
High level overview of conditions that make this optimization possible is both sides of the join refer to same table, then joined columns when combined must form a unique index on the shared table. If any join filter is specified whether query produces a single row cannot be determined, as filter expression may return false.
Optimization Step
Sample optimizable query with employees table and id as the primary key.
SELECT a.id
FROM employees a
JOIN employees b ON a.id = b.id
WHERE b.department = 'HR';
Unoptimized logical plan.
Projection: a.id
Filter: b.department = Utf8("HR")
Inner Join: Filter: a.id = b.id
SubqueryAlias: a
TableScan: employees
SubqueryAlias: b
TableScan: employees
The query can be optimized by removing duplicate TableScan and SubqueryAlias nodes.
Projection: a.id
Filter: a.department = Utf8("HR")
SubqueryAlias: a
TableScan: employees
## EliminateAggregationSelfJoin
Following sample query with self-join can be used to gather analytics on product sales.
SELECT a.user_id, a.purchase_date, SUM(b.amount) AS running_total
FROM purchases a
JOIN purchases b ON a.user_id = b.user_id AND b.purchase_date <= a.purchase_date
GROUP BY a.user_id, a.purchase_date;
This query can more efficiently rewritten with a window expression.
SELECT
user_id,
purchase_date,
SUM(amount) OVER (PARTITION BY user_idORDER BY purchase_date ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM
purchases;
Without going into why this works, the conditions where this optimization can be applied is
- It should be inner join between same tables, a.k.a self-join
JOIN ON <join columns> AND <filter expression>- Joined columns should NOT form a unique index
- Filter expression should be binary expression with a comparison operator and should refer to the same column
- Columns in
JOIN ON ...should be the same asGROUP BY
These conditions conservatively allow self-join to be replaced with a window expression.
Closes: #14758
FYI @irenjj
Current implementation fails in call to assert_valid_optimization where schema of a optimization pass is compared against the previous state. Failure occurs when alias is replaced with an other. For example
SELECT
b.id
FROM
employees a
JOIN employees b USING (id)
WHERE
b.department = 'HR'
Since alias b won't exist after the optimization it is replaced with a. This causes TableReferences to not be the same and the invariant fails eventhough both of the alises refer to the same table.
I will try to take a look at this this weekend @atahanyorganci
@atahanyorganci Hello, would you still be interested in continuing with this?
@atahanyorganci Hello, would you still be interested in continuing with this?
I’ll drive this to completion tomorrow.
@berkaysynnada I'll be happy to review it
@alamb @xudong963 @jonathanc-n If you have some time, could you please review this PR (or tag others who might be interested)?
I worked closely with @atahanyorganci on this, and I believe these two rules provide a good foundation for eliminating self-joins. This area of optimization has a big potential for further improvements.
To avoid introducing regressions or bugs, I’ve taken a very conservative approach. Optimizations are only applied when specific patterns are clearly matched. I’ve also tried to include tests from multiple perspectives. Thanks in advance
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.