hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-28744: fix INT_TYPE and BIGINT_TYPE range estimate values

Open barking-code opened this issue 1 year ago • 6 comments

What changes were proposed in this pull request?

Fix INT and BIGINT type column range statistics estimate values.

Why are the changes needed?

When hive.stats.estimate and hive.stats.fetch.column.stats are enabled and Int or bigint columns don't have col stats about ranges, incorrect ranges are used in the query plan, which returns the wrong num rows.

e.g) If {BIGINT Type col} > INTEGER.MAX_VALUE is in query, estimated numRows is always 0.

  1. No stats for bigint type column. image-2025-2-5_0-11-21

  2. Write a query using {BIGINT Type col} > INTEGER.MAX_VALUE, and EXPLAIN it. image-2025-2-5_0-9-25

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

barking-code avatar Feb 07 '25 06:02 barking-code

@barking-code Thanks for locating the issue and providing the fix. Could you update the output of the failing test cases? Please feel free to ask me if you have any questions. https://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-5637/1/pipeline/580/

okumin avatar Feb 08 '25 05:02 okumin

@barking-code Thanks for locating the issue and providing the fix. Could you update the output of the failing test cases? Please feel free to ask me if you have any questions. https://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-5637/1/pipeline/580/

@okumin Hello, okumin. I can't access this page. Displaying 404 error, but I think it is an authentication or authorization issue. Do you know how I can access this page? And do I need to add tests to pass CI tests?

barking-code avatar Feb 10 '25 08:02 barking-code

@barking-code , first login to the CI (top right hand side) before directly opening the URL mentioned in above comments. From CI, you can check which tests are failing and update/fix them accordingly. Screenshot 2025-02-10 at 9 42 30 PM

If you want to overwrite the q.out files then you can use a command like this: mvn test -Dtest=<DriverName> -Dqfile=<test_name>.q -Dtest.output.overwrite For example: mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=msck_repair_8.q -Drat.skip -Dtest.output.overwrite

Aggarwal-Raghav avatar Feb 10 '25 16:02 Aggarwal-Raghav

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.

github-actions[bot] avatar Apr 13 '25 00:04 github-actions[bot]

@barking-code Do you have a chance to update this branch? You can also take a look at a new document on how to run or update test cases. https://hive.apache.org/development/qtest/

okumin avatar Apr 14 '25 02:04 okumin

Changes to the INT_TYPE range cause execution problems. Reducer is waiting for the input rows, but nothing is being sent.

deniskuzZ avatar Jun 22 '25 08:06 deniskuzZ

I am taking it over since there have been no updates on the PR, the fix is incomplete and breaks the code

deniskuzZ avatar Jun 22 '25 15:06 deniskuzZ

@kasakrisz, @okumin please take a look

deniskuzZ avatar Jun 23 '25 06:06 deniskuzZ

@barking-code, @deniskuzZ

set hive.support.concurrency=true;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

create table t1 (a bigint) stored as orc TBLPROPERTIES ('transactional'='true');

insert into t1 values
(1),
(1704034800001),
(1704034800001),
(1704034800001),
(1704034800001),
(1704034800001),
(1704034800001),
(1704034800001),
(1704034800001);

delete from t1 where a = 1; -- makes the column stats stale

set hive.fetch.task.conversion=none;

explain
select * from t1 where a >= 1704034800000;

explain
select * from t1 where a >= 10;

Partial plan of the first query:

select * from t1 where a >= 1704034800000

                TableScan
                  alias: t1
                  filterExpr: (a >= 1704034800000L) (type: boolean)
                  Statistics: Num rows: 8 Data size: 64 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (a >= 1704034800000L) (type: boolean)
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE

Based on stats estimations the TS reads 8 rows and the filter forward only 1 row. I run the query using both apache master and applying this patch but got the same plan and row count estimates.

Partial plan of the second query:

select * from t1 where a >= 10

                TableScan
                  alias: t1
                  filterExpr: (a >= 10L) (type: boolean)
                  Statistics: Num rows: 8 Data size: 64 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (a >= 10L) (type: boolean)
                    Statistics: Num rows: 8 Data size: 64 Basic stats: COMPLETE Column stats: COMPLETE

This time the filter forward 8 rows bot on apache master and with this patch.

Please provide the repro steps to move this forward. Thank you.

kasakrisz avatar Jun 23 '25 13:06 kasakrisz

@kasakrisz, please see the https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/explainanalyze_5.q Due to type overflow (cast long to int), we estimate the wrong number of rows i.e 1

i've originally added the following test case, but explainanalyze_5.q replicates the issue as well https://github.com/apache/hive/pull/5637/files#diff-b48e5b1dc71b6ee41016e5bd56ba9e725d3e52f362aeb82de939e41de30cc753R477

set hive.stats.column.autogather=false;
set hive.stats.estimate=true;
set hive.stats.fetch.column.stats=true;

set hive.support.concurrency=true;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

create table dot_n0(cint INT, cbigint BIGINT) stored as orc TBLPROPERTIES ('transactional'='true');

insert into table dot_n0 values
    (1,1), (2,2), (3,3), (4,4), (5,5), (-1070551680, 1704034800001), (-1070551681, 1704034800002);

explain analyze delete from dot_n0 where cint < -1070551679;
explain analyze delete from dot_n0 where cbigint >= 1704034800000;

deniskuzZ avatar Jun 23 '25 14:06 deniskuzZ

Do we have test coverage for all the affected datatypes (int, long, bigint, decimal, date, timestamp) ?

kasakrisz avatar Jun 24 '25 09:06 kasakrisz

Do we have test coverage for all the affected datatypes (int, long, bigint, decimal, date, timestamp) ?

added additional tests

deniskuzZ avatar Jun 24 '25 15:06 deniskuzZ

@okumin, @kasakrisz, code freeze is today. Could we please move forward with the fix?

deniskuzZ avatar Jun 25 '25 08:06 deniskuzZ

Do we have test coverage for all the affected datatypes (int, long, bigint, decimal, date, timestamp) ?

I haven't changed the date & timestamp. Only int, bigint (represented as long) and decimal (see tpcds tests, more accurate row could estimation) existing q tests cover the intent of this PR

The reason why I ask is that for exmaple timestap had

            long maxValue = cs.getRange().maxValue.longValue();
            long minValue = cs.getRange().minValue.longValue();

So the minValue was also long. In this patch it is int.

kasakrisz avatar Jun 25 '25 12:06 kasakrisz

Do we have test coverage for all the affected datatypes (int, long, bigint, decimal, date, timestamp) ?

I haven't changed the date & timestamp. Only int, bigint (represented as long) and decimal (see tpcds tests, more accurate row could estimation) existing q tests cover the intent of this PR

The reason why I ask is that for exmaple timestap had

            long maxValue = cs.getRange().maxValue.longValue();
            long minValue = cs.getRange().minValue.longValue();

So the minValue was also long. In this patch it is int.

fixed

deniskuzZ avatar Jun 25 '25 14:06 deniskuzZ

not chasing this in 4.1 release anymore

deniskuzZ avatar Jun 25 '25 19:06 deniskuzZ

@kasakrisz, have I covered all your concerns? Let me know if there are any additional tests you'd like to include: https://github.com/apache/hive/blob/ecd3dfab281e647fceed18db1ed5ebe748822908/ql/src/test/queries/clientpositive/explainanalyze_6.q"

FYI, the latest commit also updated the date and timestamp ranges to future values: https://github.com/apache/hive/pull/5637/files#diff-1b0e96106a06cddf1ca7e77f3b4cf45d849d86db4766fa7a8e9e945567f41cf3R147

deniskuzZ avatar Jun 26 '25 11:06 deniskuzZ