HIVE-28744: fix INT_TYPE and BIGINT_TYPE range estimate values
What changes were proposed in this pull request?
Fix INT and BIGINT type column range statistics estimate values.
Why are the changes needed?
When hive.stats.estimate and hive.stats.fetch.column.stats are enabled and Int or bigint columns don't have col stats about ranges, incorrect ranges are used in the query plan, which returns the wrong num rows.
e.g) If {BIGINT Type col} > INTEGER.MAX_VALUE is in query, estimated numRows is always 0.
-
No stats for bigint type column.
-
Write a query using
{BIGINT Type col} > INTEGER.MAX_VALUE, and EXPLAIN it.
Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?
Quality Gate passed
Issues
4 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
@barking-code Thanks for locating the issue and providing the fix. Could you update the output of the failing test cases? Please feel free to ask me if you have any questions. https://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-5637/1/pipeline/580/
@barking-code Thanks for locating the issue and providing the fix. Could you update the output of the failing test cases? Please feel free to ask me if you have any questions. https://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-5637/1/pipeline/580/
@okumin Hello, okumin. I can't access this page. Displaying 404 error, but I think it is an authentication or authorization issue. Do you know how I can access this page? And do I need to add tests to pass CI tests?
@barking-code , first login to the CI (top right hand side) before directly opening the URL mentioned in above comments. From CI, you can check which tests are failing and update/fix them accordingly.
If you want to overwrite the q.out files then you can use a command like this:
mvn test -Dtest=<DriverName> -Dqfile=<test_name>.q -Dtest.output.overwrite
For example:
mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=msck_repair_8.q -Drat.skip -Dtest.output.overwrite
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.
@barking-code Do you have a chance to update this branch? You can also take a look at a new document on how to run or update test cases. https://hive.apache.org/development/qtest/
Changes to the INT_TYPE range cause execution problems. Reducer is waiting for the input rows, but nothing is being sent.
I am taking it over since there have been no updates on the PR, the fix is incomplete and breaks the code
@kasakrisz, @okumin please take a look
@barking-code, @deniskuzZ
set hive.support.concurrency=true;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
create table t1 (a bigint) stored as orc TBLPROPERTIES ('transactional'='true');
insert into t1 values
(1),
(1704034800001),
(1704034800001),
(1704034800001),
(1704034800001),
(1704034800001),
(1704034800001),
(1704034800001),
(1704034800001);
delete from t1 where a = 1; -- makes the column stats stale
set hive.fetch.task.conversion=none;
explain
select * from t1 where a >= 1704034800000;
explain
select * from t1 where a >= 10;
Partial plan of the first query:
select * from t1 where a >= 1704034800000
TableScan
alias: t1
filterExpr: (a >= 1704034800000L) (type: boolean)
Statistics: Num rows: 8 Data size: 64 Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: (a >= 1704034800000L) (type: boolean)
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Based on stats estimations the TS reads 8 rows and the filter forward only 1 row. I run the query using both apache master and applying this patch but got the same plan and row count estimates.
Partial plan of the second query:
select * from t1 where a >= 10
TableScan
alias: t1
filterExpr: (a >= 10L) (type: boolean)
Statistics: Num rows: 8 Data size: 64 Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: (a >= 10L) (type: boolean)
Statistics: Num rows: 8 Data size: 64 Basic stats: COMPLETE Column stats: COMPLETE
This time the filter forward 8 rows bot on apache master and with this patch.
Please provide the repro steps to move this forward. Thank you.
@kasakrisz, please see the https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/explainanalyze_5.q Due to type overflow (cast long to int), we estimate the wrong number of rows i.e 1
i've originally added the following test case, but explainanalyze_5.q replicates the issue as well https://github.com/apache/hive/pull/5637/files#diff-b48e5b1dc71b6ee41016e5bd56ba9e725d3e52f362aeb82de939e41de30cc753R477
set hive.stats.column.autogather=false;
set hive.stats.estimate=true;
set hive.stats.fetch.column.stats=true;
set hive.support.concurrency=true;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
create table dot_n0(cint INT, cbigint BIGINT) stored as orc TBLPROPERTIES ('transactional'='true');
insert into table dot_n0 values
(1,1), (2,2), (3,3), (4,4), (5,5), (-1070551680, 1704034800001), (-1070551681, 1704034800002);
explain analyze delete from dot_n0 where cint < -1070551679;
explain analyze delete from dot_n0 where cbigint >= 1704034800000;
Do we have test coverage for all the affected datatypes (int, long, bigint, decimal, date, timestamp) ?
Do we have test coverage for all the affected datatypes (int, long, bigint, decimal, date, timestamp) ?
added additional tests
@okumin, @kasakrisz, code freeze is today. Could we please move forward with the fix?
Do we have test coverage for all the affected datatypes (int, long, bigint, decimal, date, timestamp) ?
I haven't changed the date & timestamp. Only int, bigint (represented as long) and decimal (see tpcds tests, more accurate row could estimation) existing q tests cover the intent of this PR
The reason why I ask is that for exmaple timestap had
long maxValue = cs.getRange().maxValue.longValue();
long minValue = cs.getRange().minValue.longValue();
So the minValue was also long. In this patch it is int.
Do we have test coverage for all the affected datatypes (int, long, bigint, decimal, date, timestamp) ?
I haven't changed the date & timestamp. Only int, bigint (represented as long) and decimal (see tpcds tests, more accurate row could estimation) existing q tests cover the intent of this PR
The reason why I ask is that for exmaple timestap had
long maxValue = cs.getRange().maxValue.longValue(); long minValue = cs.getRange().minValue.longValue();So the
minValuewas alsolong. In this patch it isint.
fixed
not chasing this in 4.1 release anymore
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
@kasakrisz, have I covered all your concerns? Let me know if there are any additional tests you'd like to include: https://github.com/apache/hive/blob/ecd3dfab281e647fceed18db1ed5ebe748822908/ql/src/test/queries/clientpositive/explainanalyze_6.q"
FYI, the latest commit also updated the date and timestamp ranges to future values: https://github.com/apache/hive/pull/5637/files#diff-1b0e96106a06cddf1ca7e77f3b4cf45d849d86db4766fa7a8e9e945567f41cf3R147