dolt
dolt copied to clipboard
Detect and deduplicate self-joins on unique keys
A table that joins itself on a unique key can be deduplicated:
select * from mytable a join mytable b on a.i = b.i;
=>
select i, s, i, s from mytable;
In the above query, (i)
is a unique key, and the output of the join will simply be the output of a single table.
This came up recently in a context where there was a self join on a non-unique key, but with a DISTINCT field that in practice collapses any join cardinality amplification:
select count(*) from (select distinct a.i from mytable a join mytable b on a.s = b.s);
=>
select count(*) from mytable;
In the above query, we join two tables on a non-unique key, (s)
. Because (s)
is non-unique, the output of a self-join on s
might be greater than the table on its own. The DISTINCT operator removes any of this duplication, letting us remove the self-join.
In the general case, DISTINCT only nullifies some varieties of join duplication:
select * from (select distinct a,i, count(*) as cnt from mytable a join mytable b on a.s = b.s where cnt > 3)
In the query above, we added a filter between the self-join and DISTINCT that depends on the self-join cardinality. Aggregations, windows, or scalar subqueries whose logic depends on self-join cardinality all have this issue (and potentially other operators).
I think it might be safe to apply this optimization:
- same table joined on a unique key
- same table joined on a non-unique key and a DISTINCT operator and no aggregation/window/scalar subquery expression between the join and DISTINCT
I may be overcomplicating this, but I think the transformation needs to happen between building the memo and creating the join reorder graph. This is because we need the following to do redundancy checks:
- list of the nodes in the graph (to partition by underlying table)
- equivalent column sets (to check that a table is self-joined)
- list of indexes supplemented with column ids (to check that a self-join columns complete an index column set)
Without these, we can't (1) partition the nodes by underlying table, (2) determine which tables have overlapping equivalency sets, and then (3) check whether the partitioned tables have any indexes whose columns are subsets of the equivalency sets. The only way to do this with the old transformation rule style is to basically build the memo/relProps separately from join reordering, or worse round-trip the memo twice. Collecting this info is expensive and I think it's probably easier and more desirable to just separate memo building from reorder graph building. The memo separated from reorder graph would let us perform safer and more powerful transformations.
This might take a few days, so I am going to defer for more direct client problems until this arises again.