duckdb
duckdb copied to clipboard
Reorder semi and anti joins.
follow up of https://github.com/duckdb/duckdb/pull/11573
To do this, we do the following,
- When we look at reordering and what relations can join with what other relations, we treat semi joins and anti joins like regular inner joins.
- To make sure the left and right children are not flipped, we compare the left and right bindings from the original join to the left and right bindings of the recreated plan. If the children are flipped, we flip them back.
- When calculating the cost of a semi join we pretend the left table (i.e the table that propagates the up) is being filtered on. We take the cardinality of the left side and multiply it by 0.2.
Since this cardinality estimation method uses multiplication, it is also symmetrical, which means we don't have to worry about different join plans for the same set of relations having different estimated cardinalities.
Calculating the denominator of the estimated cardinality is easier now. It works like finding a maximum spanning tree. Assuming relations are nodes and join filters are weighted edges, the process of finding the most selective filters is exactly like a maximum spanning tree problem. The weights of the edges come from the shared total domain of the columns of the filter.
Some other small improvements:
- More checks in the cardinality estimator to make sure the denominator is not 0 (since that will cause very inaccurate cardinality estimates)
- More checks during statistics extraction to make sure the distinct count of a column is always between 1 and the cardinality of the table.
Another heuristic was added for determining the number of distinct elements in a column as well. For integral type columns, if the maxVal - minVal is less than the distinct count measured by HLL, then DuckDB will prefer max - min as the distinct count.
I thought this was the source of a bad join order. Turns out that wasn't the case, but I think it is still a good join heuristic.
Looks much more readable now with the refactor, thanks!
Are the changes to the
ColumnLifetimeAnalyzeron purpose, or were you doing that on another branch?Also, we seem to have a regression for TPC-DS q95, which has two semi-joins. This wasn't the case in your previous PR. Could you check what's going on there? Maybe the semi-join cardinality estimation heuristic is too simple?
Noo, just bugs in my code, fixes it with the most recent commit
@Mytherin this ready to go now. You reviewed once already, but I had to battle CI for a while since I also needed to patch substrait
Thanks!