hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-28363: Improve heuristics of FilterStatsRule without column stats

Open okumin opened this issue 1 year ago • 2 comments

What changes were proposed in this pull request?

Make FilterStatsRule reduce the number of filtered rows by half when # of distinct values is empty. https://issues.apache.org/jira/browse/HIVE-28363

Why are the changes needed?

The current algorithm easily estimates the selectivity to be 100%. I believe it is not the best in most cases. FilterStatsRule roughly estimates the number of rows filtered by IN to be {Original # of rows} * {1 / cardinality} * {# of values in IN}. The second term is estimated as 0.5 when column stats are unavailable. So, it always returns the original number when IN retains two or more constant values like col IN (1, 3).

Maybe, FilterStatsRule had behaved in the same way as this PR before, but this change slightly changed the formula to cover a special case. We will likely prefer the original formula when columns stats are not available.

Does this PR introduce any user-facing change?

No.

Is the change a dependency upgrade?

No.

How was this patch tested?

CREATE TABLE users (id INT);
INSERT INTO users VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
set hive.fetch.task.conversion=none;
set hive.stats.fetch.column.stats=false;
EXPLAIN SELECT * FROM users WHERE id IN (1);
EXPLAIN SELECT * FROM users WHERE id IN (1, 2);

okumin avatar Jul 08 '24 08:07 okumin

@kgyrtkirk Can we merge this one or do I need something to fix? Thanks!

okumin avatar Jul 24 '24 09:07 okumin

I rebased this branch on the current master yesterday

okumin avatar Sep 20 '24 09:09 okumin

@kgyrtkirk I just merged the change. If you have further comments. we can optimize it later. :) Thanks.

zhangbutao avatar Sep 28 '24 02:09 zhangbutao

Thank you, both!

okumin avatar Sep 29 '24 06:09 okumin