HIVE-28363: Improve heuristics of FilterStatsRule without column stats
What changes were proposed in this pull request?
Make FilterStatsRule reduce the number of filtered rows by half when # of distinct values is empty. https://issues.apache.org/jira/browse/HIVE-28363
Why are the changes needed?
The current algorithm easily estimates the selectivity to be 100%. I believe it is not the best in most cases.
FilterStatsRule roughly estimates the number of rows filtered by IN to be {Original # of rows} * {1 / cardinality} * {# of values in IN}. The second term is estimated as 0.5 when column stats are unavailable. So, it always returns the original number when IN retains two or more constant values like col IN (1, 3).
Maybe, FilterStatsRule had behaved in the same way as this PR before, but this change slightly changed the formula to cover a special case. We will likely prefer the original formula when columns stats are not available.
Does this PR introduce any user-facing change?
No.
Is the change a dependency upgrade?
No.
How was this patch tested?
CREATE TABLE users (id INT);
INSERT INTO users VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
set hive.fetch.task.conversion=none;
set hive.stats.fetch.column.stats=false;
EXPLAIN SELECT * FROM users WHERE id IN (1);
EXPLAIN SELECT * FROM users WHERE id IN (1, 2);
Quality Gate passed
Issues
4 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
@kgyrtkirk Can we merge this one or do I need something to fix? Thanks!
Quality Gate passed
Issues
4 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
I rebased this branch on the current master yesterday
@kgyrtkirk I just merged the change. If you have further comments. we can optimize it later. :) Thanks.
Thank you, both!