noisepage icon indicating copy to clipboard operation
noisepage copied to clipboard

Derived column statistics for query optimization

Open jkosh44 opened this issue 3 years ago • 0 comments

Feature Request

Summary

#1504 Adds the ability for the optimizer to use column and table statistics in order to compute query selectivity. However, the optimizer only uses statistics on physical tables in the database, we aren't able to derive statistics on intermediate query outputs. For example, if we have a join then we use column statistics to determine the selectivity of each underlying get, but we only use the number of rows from each get to approximate the number of rows of the join. The problem gets worse when we have more joins or deeper operator trees. Only Logical Gets use column statistics to compute selectivity, all other operators use the number of rows of its children to approximate selectivity.

Previously operators would derive statistics by copying their children's statistics and updating the number of rows in the statistics object, no other field would get updated. These "derived" statistics were not used anywhere and were not particularly accurate, so they were removed in #1504.

Being able to accurately derive column statistics from operator outputs would increase our accuracy in selectivity computations and help the cost model make more accurate decisions.

This issue may be captured in #436

jkosh44 avatar Mar 24 '21 00:03 jkosh44