hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-27725: Remove redundant columns in TAB_COL_STATS and PART_COL_STATS tables

Open wecharyu opened this issue 1 year ago • 31 comments

What changes were proposed in this pull request?

  1. Remove CAT_NAME, DB_NAME and TABLE_NAME from TAB_COL_STATS table.
  2. Remove CAT_NAME, DB_NAME, TABLE_NAME and PARTITION_NAME columns from PART_COL_STATS table.

Why are the changes needed?

  1. Refine the metadata schema, remove redundant columns of TAB_COL_STATS and PART_COL_STATS
  2. Unify get table stats, update table stats and delete table stats, make sure all query use the join clause to avoid data inconsistency.

Does this PR introduce any user-facing change?

No.

Is the change a dependency upgrade?

No.

How was this patch tested?

Add a new unit test.

wecharyu avatar Sep 24 '23 15:09 wecharyu

@saihemanth-cloudera @deniskuzZ @dengzhhu653: Could you help review this PR?

wecharyu avatar Sep 26 '23 15:09 wecharyu

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 22 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

warning The version of Java (11.0.8) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17. Read more here

sonarqubecloud[bot] avatar Oct 06 '23 07:10 sonarqubecloud[bot]

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 22 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

warning The version of Java (11.0.8) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17. Read more here

sonarqubecloud[bot] avatar Nov 13 '23 14:11 sonarqubecloud[bot]

Need to rerun the CI to get a green pass.

zhangbutao avatar Dec 19 '23 12:12 zhangbutao

Currently DirectSqlUpdateStat is used only in ETL flow, can we use directSql stats update in HiveAlterHandler(rename table/partition/col, drop col) as well ?

nareshpr avatar Dec 22 '23 02:12 nareshpr

Please check whether all this columns can be removed from hive_metastore.thrift ColumnStatisticsDesc

nareshpr avatar Dec 22 '23 03:12 nareshpr

Currently DirectSqlUpdateStat is used only in ETL flow, can we use directSql stats update in HiveAlterHandler(rename table/partition/col, drop col) as well ?

Yes, here we can make some improvement, for example renaming table/partition does not need update column statistic.

Please check whether all this columns can be removed from hive_metastore.thrift ColumnStatisticsDesc

We can not remove these columns in ColumnStatisticsDecs, because:

  1. it's the API data struct, changing the struct would cause API incompatible.
  2. these columns are meaningful to describe the table or partition metadata.

wecharyu avatar Dec 25 '23 15:12 wecharyu

Applying the change Hive may suffer from intermittent outage when the CBO is on(HiveServer2 connects to the old HMS but with the schema change). How about leaving the redundant columns on database but HMS doesn't use these columns at all(using the multi-join proposed in this PR)?

dengzhhu653 avatar Jan 02 '24 10:01 dengzhhu653

@dengzhhu653 Not sure if i understand correctly.

Applying the change Hive may suffer from intermittent outage when the CBO is on(HiveServer2 connects to the old HMS but with the schema change).

Do you mean that upgrade-4.0.0-beta-1-to-4.0.0-beta-2.mysql.sql schema is upgraded but the hms is not? I think yes. HMS will encounter error when fetching the table&partition stats as the direct sql generated by HMS can query the underlying metadata in mysql. But i think it should not be a problem, users should know that if they upgraded the metadata schema, they also should upgrade the HMS binary package at the same.

How about leaving the redundant columns on database but HMS doesn't use these columns at all(using the multi-join proposed in this PR)?

IMO, It's not a good idea that leaving the redundant columns in order for older HMS stats working. Now that you've upgrade the metadata schema, why don't you upgrade the HMS binary package? I can't think of a reason for this.

Also, it seems that we never promised that the old hms can work well with the newer&upgraded metadata schema. Maybe any change on metadata schema can make old HMS occur error, not just this PR.

zhangbutao avatar Jan 03 '24 16:01 zhangbutao

@zhangbutao

But i think it should not be a problem, users should know that if they upgraded the metadata schema, they also should upgrade the HMS binary package at the same.

In my opinion, the upgrade for schema and the binary can't happen at the same time, especially if there are multiple different HMS instances and clusters. For this change, we drop columns from the old table, for safety we need to back up the table first before doing any operation, and avoid the old HMS writing into this table during the upgrade, you cannot tell the user not roll back to the old version if they find something wrong.

Maybe any change on metadata schema can make old HMS occur error, not just this PR.

that's possible, so we need to reduce the possibility of such thing recurring, especially for the user to recover from failure

dengzhhu653 avatar Jan 04 '24 13:01 dengzhhu653

Hi @dengzhhu653 , I agree with @zhangbutao that we should also clear the schema in this patch. If we want a smooth upgrade for this feature, we need do as follows: Stage1: new_version1 read stat by join id and still write all redundant names (just as you suggest), old_version read by names, Stage2: new_version2 write without redundant names, both new_version2 and new_version1 read by join id Stage3: remove the redundant columns in schema (new_version3 for schema update)

We will need three patches (versions) to achieve this feature, which seems a bit tedious. If we combine all changes in this patch, and upgrade HMS clusters one by one, and finally change the schema. Then the only issue was that the old HMS instances could not read the new stats created by new HMS instances, I think it's acceptable. Furthermore, we can offline all old instances in the process of upgrade to avoid this issue.

wecharyu avatar Jan 04 '24 15:01 wecharyu

Can you fix the failed tests? http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-4744/18/tests

zhangbutao avatar Jan 11 '24 07:01 zhangbutao

@zhangbutao CI is green.

wecharyu avatar Jan 13 '24 11:01 wecharyu

There are some discussions about the performance regression after deprecating the redundant columns: https://issues.apache.org/jira/browse/HIVE-11786

dengzhhu653 avatar Jan 16 '24 07:01 dengzhhu653

performance regression is what we need concern about. @wecharyu has did some benchmark to validate the performance is ok after this change. But the benchmark is on the local HMS benchmark tools, maybe this can not show the actual production use case? It's better to do a more extreme benchmart test, as HIVE-11786 said, 2million+ in PART_COL_STATS. @wecharyu Do you have any actual production use case(big partitions with PART_COL_STATS) to show no performance regression for this change?

zhangbutao avatar Jan 16 '24 07:01 zhangbutao

@dengzhhu653 @zhangbutao I made another benchmark test with 2 millions partitions stats written in advance, it shows no obvious performance regression:

mysql> select count(1) from PART_COL_STATS;
+----------+
| count(1) |
+----------+
|  2000000 |
+----------+
1 row in set (0.40 sec)

benchmark test:

java -jar ./hmsbench-jar-with-dependencies.jar -H localhost --savedata /tmp/benchdata --sanitize -N 100 -N 1000 -o bench_results_direct.csv -C -d testbench_http --params=100  -E 'drop.*' -E 'renameTable.*' -E 'getTableObjectsByName.*' -E 'listTables.*' -E 'listPartitions.*' -E 'getPartitionsByNames.*' -E 'getPartitionNames.*' -E 'listPartition' -E 'getPartition' -E 'getPartitions' -E 'getPartitions.10' -E 'getPartitions.100' -E 'getPartitions.1000' -E 'addPartition.*' -E 'addPartitions.*' -E 'alterPartitions.*' -E 'getNid' -E 'listDatabases' -E 'getTable' -E 'createTable' -E 'openTxn.*'
  • before this patch
Operation       Mean    Med     Min     Max     Err%
getPartitionsStat       5.21167 5.16801 4.92140 6.05965 3.77022
getPartitionsStat.100   6.93186 6.83728 6.48675 10.2091 6.80759
getPartitionsStat.1000  15.1901 14.8172 14.3164 19.6772 6.61940
updatePartitionsStat    9.83066 9.63766 9.27253 16.3278 9.28177
updatePartitionsStat.100        1009.46 1009.26 991.282 1052.16 0.956140
updatePartitionsStat.1000       10091.7 10088.1 9929.50 10309.3 0.760790
  • after this patch
Operation       Mean    Med     Min     Max     Err%
getPartitionsStat       5.56409 5.49373 5.20583 7.02619 5.03727
getPartitionsStat.100   6.34526 6.29966 5.97725 7.85943 4.11913
getPartitionsStat.1000  14.2403 14.1247 13.6040 15.8745 3.02256
updatePartitionsStat    10.5586 10.3743 9.88599 14.8948 7.01613
updatePartitionsStat.100        1013.06 1011.71 978.329 1047.57 1.45127
updatePartitionsStat.1000       9912.52 9905.62 9677.24 10163.9 1.22903

wecharyu avatar Jan 20 '24 03:01 wecharyu

@dengzhhu653 @zhangbutao I made another benchmark test with 2 millions partitions stats written in advance, it shows no obvious performance regression:

mysql> select count(1) from PART_COL_STATS;
+----------+
| count(1) |
+----------+
|  2000000 |
+----------+
1 row in set (0.40 sec)

benchmark test:

java -jar ./hmsbench-jar-with-dependencies.jar -H localhost --savedata /tmp/benchdata --sanitize -N 100 -N 1000 -o bench_results_direct.csv -C -d testbench_http --params=100  -E 'drop.*' -E 'renameTable.*' -E 'getTableObjectsByName.*' -E 'listTables.*' -E 'listPartitions.*' -E 'getPartitionsByNames.*' -E 'getPartitionNames.*' -E 'listPartition' -E 'getPartition' -E 'getPartitions' -E 'getPartitions.10' -E 'getPartitions.100' -E 'getPartitions.1000' -E 'addPartition.*' -E 'addPartitions.*' -E 'alterPartitions.*' -E 'getNid' -E 'listDatabases' -E 'getTable' -E 'createTable' -E 'openTxn.*'
  • before this patch
Operation       Mean    Med     Min     Max     Err%
getPartitionsStat       5.21167 5.16801 4.92140 6.05965 3.77022
getPartitionsStat.100   6.93186 6.83728 6.48675 10.2091 6.80759
getPartitionsStat.1000  15.1901 14.8172 14.3164 19.6772 6.61940
updatePartitionsStat    9.83066 9.63766 9.27253 16.3278 9.28177
updatePartitionsStat.100        1009.46 1009.26 991.282 1052.16 0.956140
updatePartitionsStat.1000       10091.7 10088.1 9929.50 10309.3 0.760790
  • after this patch
Operation       Mean    Med     Min     Max     Err%
getPartitionsStat       5.56409 5.49373 5.20583 7.02619 5.03727
getPartitionsStat.100   6.34526 6.29966 5.97725 7.85943 4.11913
getPartitionsStat.1000  14.2403 14.1247 13.6040 15.8745 3.02256
updatePartitionsStat    10.5586 10.3743 9.88599 14.8948 7.01613
updatePartitionsStat.100        1013.06 1011.71 978.329 1047.57 1.45127
updatePartitionsStat.1000       9912.52 9905.62 9677.24 10163.9 1.22903

From what I see in benchmarkGetPartitionsStat, looks there is only one table, thousands of partitions and col stats, am I missing something? I guess the performance regression is caused by multiple join after removing the columns, how many databases, tables, partitions in the bench test?

Thanks, Zhihua

dengzhhu653 avatar Jan 21 '24 01:01 dengzhhu653

From what I see in benchmarkGetPartitionsStat, looks there is only one table, thousands of partitions and col stats, am I missing something?

2 millions partition column stats are added in the mysql before benchmark test. benchmarkGetPartitionsStat will still only query the count as input by -N options.

I guess the performance regression is caused by multiple join after removing the columns, how many databases, tables, partitions in the bench test?

The 2 millions partition column stats are: 10 dbs * 10 tbls per db * 200 partitions per tbl * 100 column stats per partition

BTW, all joins across DBS, TBLS, and PARTITIONS tables would hit indexes, I don't think such join would cause performance regression:

mysql> explain select * from PART_COL_STATS join PARTITIONS on PARTITIONS.PART_ID=PART_COL_STATS.PART_ID join TBLS on TBLS.TBL_ID=PARTITIONS.TBL_ID join DBS on DBS.DB_ID=TBLS.DB_ID where DBS.NAME='db0' and DBS.CTLG_NAME='hive' and
 TBLS.TBL_NAME='tbl0' and PARTITIONS.PART_NAME='p=part_val0' and PART_COL_STATS.engine='hive' and PART_COL_STATS.COLUMN_NAME='id0';
+----+-------------+----------------+------------+-------+----------------------------------+-----------------+---------+-------------+------+----------+-------------+
| id | select_type | table          | partitions | type  | possible_keys                    | key             | key_len | ref         | rows | filtered | Extra       |
+----+-------------+----------------+------------+-------+----------------------------------+-----------------+---------+-------------+------+----------+-------------+
|  1 | SIMPLE      | DBS            | NULL       | const | PRIMARY,UNIQUE_DATABASE,CTLG_FK1 | UNIQUE_DATABASE | 389     | const,const |    1 |   100.00 | NULL        |
|  1 | SIMPLE      | TBLS           | NULL       | const | PRIMARY,UNIQUETABLE,TBLS_N49     | UNIQUETABLE     | 268     | const,const |    1 |   100.00 | NULL        |
|  1 | SIMPLE      | PARTITIONS     | NULL       | const | PRIMARY,UNIQUEPARTITION          | UNIQUEPARTITION | 779     | const,const |    1 |   100.00 | NULL        |
|  1 | SIMPLE      | PART_COL_STATS | NULL       | ref   | PCS_STATS_IDX                    | PCS_STATS_IDX   | 777     | const,const |    1 |    10.00 | Using where |
+----+-------------+----------------+------------+-------+----------------------------------+-----------------+---------+-------------+------+----------+-------------+
4 rows in set, 1 warning (0.00 sec)

wecharyu avatar Jan 21 '24 09:01 wecharyu

From what I see in benchmarkGetPartitionsStat, looks there is only one table, thousands of partitions and col stats, am I missing something?

2 millions partition column stats are added in the mysql before benchmark test. benchmarkGetPartitionsStat will still only query the count as input by -N options.

I guess the performance regression is caused by multiple join after removing the columns, how many databases, tables, partitions in the bench test?

The 2 millions partition column stats are: 10 dbs * 10 tbls per db * 200 partitions per tbl * 100 column stats per partition

BTW, all joins across DBS, TBLS, and PARTITIONS tables would hit indexes, I don't think such join would cause performance regression:

mysql> explain select * from PART_COL_STATS join PARTITIONS on PARTITIONS.PART_ID=PART_COL_STATS.PART_ID join TBLS on TBLS.TBL_ID=PARTITIONS.TBL_ID join DBS on DBS.DB_ID=TBLS.DB_ID where DBS.NAME='db0' and DBS.CTLG_NAME='hive' and
 TBLS.TBL_NAME='tbl0' and PARTITIONS.PART_NAME='p=part_val0' and PART_COL_STATS.engine='hive' and PART_COL_STATS.COLUMN_NAME='id0';
+----+-------------+----------------+------------+-------+----------------------------------+-----------------+---------+-------------+------+----------+-------------+
| id | select_type | table          | partitions | type  | possible_keys                    | key             | key_len | ref         | rows | filtered | Extra       |
+----+-------------+----------------+------------+-------+----------------------------------+-----------------+---------+-------------+------+----------+-------------+
|  1 | SIMPLE      | DBS            | NULL       | const | PRIMARY,UNIQUE_DATABASE,CTLG_FK1 | UNIQUE_DATABASE | 389     | const,const |    1 |   100.00 | NULL        |
|  1 | SIMPLE      | TBLS           | NULL       | const | PRIMARY,UNIQUETABLE,TBLS_N49     | UNIQUETABLE     | 268     | const,const |    1 |   100.00 | NULL        |
|  1 | SIMPLE      | PARTITIONS     | NULL       | const | PRIMARY,UNIQUEPARTITION          | UNIQUEPARTITION | 779     | const,const |    1 |   100.00 | NULL        |
|  1 | SIMPLE      | PART_COL_STATS | NULL       | ref   | PCS_STATS_IDX                    | PCS_STATS_IDX   | 777     | const,const |    1 |    10.00 | Using where |
+----+-------------+----------------+------------+-------+----------------------------------+-----------------+---------+-------------+------+----------+-------------+
4 rows in set, 1 warning (0.00 sec)

in production, I've seen an average 100 tbl x 1000 part x 100 cols

deniskuzZ avatar Jan 22 '24 19:01 deniskuzZ

2 consecutive builds fail with the same thing: TestMiniLlapLocalCliDriver.testCliDriver[partition_explain_ddl] @wecharyu, should we regenerate the q file and check the diff?

deniskuzZ avatar Jan 25 '24 15:01 deniskuzZ

@deniskuzZ partition_explain_ddl.q qtest seems flaky. I could run pass this test by using CI docker image but not sure why it failed in the CI. On the other hand, this test failed on my local machine with different jdk(1.8.0_392-b08).

I could find the difference in local is that it generate different bitVectors in temp col stat file, but not sure why. It seems not related to this patch, because get or update part_col_stats in HMS only take effect after the temp stat file generated. Is it resonable to disable this test for now?

wecharyu avatar Jan 29 '24 16:01 wecharyu

@deniskuzZ partition_explain_ddl.q qtest seems flaky. I could run pass this test by using CI docker image but not sure why it failed in the CI. On the other hand, this test failed on my local machine with different jdk(1.8.0_392-b08).

I could find the difference in local is that it generate different bitVectors in temp col stat file, but not sure why. It seems not related to this patch, because get or update part_col_stats in HMS only take effect after the temp stat file generated. Is it resonable to disable this test for now?

let's see results of flaky check first: http://ci.hive.apache.org/job/hive-flaky-check/807/

deniskuzZ avatar Feb 01 '24 12:02 deniskuzZ

@deniskuzZ partition_explain_ddl.q qtest seems flaky. I could run pass this test by using CI docker image but not sure why it failed in the CI. On the other hand, this test failed on my local machine with different jdk(1.8.0_392-b08). I could find the difference in local is that it generate different bitVectors in temp col stat file, but not sure why. It seems not related to this patch, because get or update part_col_stats in HMS only take effect after the temp stat file generated. Is it resonable to disable this test for now?

let's see results of flaky check first: http://ci.hive.apache.org/job/hive-flaky-check/807/

@wecharyu, flaky run reported green, so looks like this PR introduced problem

deniskuzZ avatar Feb 02 '24 11:02 deniskuzZ

@wecharyu, flaky run reported green, so looks like this PR introduced problem

@deniskuzZ I could not reproduce this test failure in docker :( Here is a test result:

[INFO] --- maven-compiler-plugin:3.8.1:testCompile (default-testCompile) @ hive-it-qfile ---
[INFO] Nothing to compile - all classes are up to date
[INFO] 
[INFO] --- maven-surefire-plugin:3.0.0-M4:test (default-test) @ hive-it-qfile ---
[INFO] 
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 34.144 s - in org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  51.450 s
[INFO] Finished at: 2024-02-02T17:50:19Z
[INFO] ------------------------------------------------------------------------
jenkins@ip-10-169-5-104:~/hive/itests/qtest$ sudo mvn test -Dtest=org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver -Dqfile=partition_explain_ddl.q

Could you also help run this test in your local and docker on both this patch and master to see if there was any failures?

wecharyu avatar Feb 02 '24 17:02 wecharyu

started flaky check on your branch: http://ci.hive.apache.org/job/hive-flaky-check/808/

deniskuzZ avatar Feb 05 '24 15:02 deniskuzZ

It seems no failure... http://ci.hive.apache.org/job/hive-flaky-check/808/testReport/org.apache.hadoop.hive.cli/TestMiniLlapLocalCliDriver/

wecharyu avatar Feb 06 '24 02:02 wecharyu

It seems no failure... http://ci.hive.apache.org/job/hive-flaky-check/808/testReport/org.apache.hadoop.hive.cli/TestMiniLlapLocalCliDriver/

Yeah, then probably some other test affects it

deniskuzZ avatar Feb 06 '24 15:02 deniskuzZ

some API change

HMSBenchmarks.java:450: error: no suitable constructor found for PartitionsStatsRequest(String,String,List<String>,List<String>,String)

deniskuzZ avatar Feb 07 '24 14:02 deniskuzZ

restarted, some unrelated tests failed sonar. reported 20 new issues, are those new or just code movement?

deniskuzZ avatar Feb 20 '24 15:02 deniskuzZ

No new code, the sonar reported issues were pre-existing.

wecharyu avatar Feb 21 '24 03:02 wecharyu