hive
hive copied to clipboard
HIVE-28062: Optimize get_partitions_by_names in direct sql
What changes were proposed in this pull request?
- Add partition number check in
get_partitions_by_names
- Skipping get partition ids in
MetaStoreDirectSql#getPartitionsViaPartNames
to reduce the Database calls.
Why are the changes needed?
-
get_partitions_by_names
also need check partition number in case the request number is too large - remove unnecessary db calls can improve the performance.
Does this PR introduce any user-facing change?
No.
Is the change a dependency upgrade?
No.
How was this patch tested?
- Passing all existing tests.
- Run the
getPartitionsByNames
benchmark tests:
java -jar ./hmsbench-jar-with-dependencies.jar -H localhost --savedata /tmp/benchdata --sanitize -N 100 -N 10000 -o bench_results_direct.csv -C -d testbench_http --params=100 -E 'drop.*' -E 'renameTable.*' -E 'getTableObjectsByName.*' -E 'listTables.*' -E 'listPartitions.*' -E 'getPartitionNames.*' -E 'listPartition' -E 'getPartition' -E 'getPartitions' -E 'getPartitions.100' -E 'getPartitions.10000' -E 'getPartitionsByFilter.*' -E 'addPartition.*' -E 'addPartitions.*' -E 'alterPartitions.*' -E 'getNid' -E 'listDatabases' -E 'getTable' -E 'createTable' -E 'openTxn.*'
- Before this patch:
Operation Mean Med Min Max Err%
getPartitionsByNames 7.47374 7.31857 6.96766 10.2396 7.39022
getPartitionsByNames.100 14.4165 14.2468 13.6072 17.9526 4.67421
getPartitionsByNames.10000 524.812 524.127 504.923 555.049 2.02143
- After this patch:
Operation Mean Med Min Max Err%
getPartitionsByNames 7.11728 7.02464 6.78734 8.59082 4.32347
getPartitionsByNames.100 13.4596 13.3253 12.8197 16.3828 3.77006
getPartitionsByNames.10000 460.498 459.805 438.384 500.921 2.57908
The failed tests seem to be related. http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-5063/1/tests
The failed tests seem to be related. http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-5063/1/tests
@zhangbutao The issue arises due to the introduction of a partition number check in the getPartitionsByNames()
API, fixed it in the client side.
@zhangbutao could you take a look again?
- @dengzhhu653 @deniskuzZ @ayushtkn @saihemanth-cloudera may be interested in this improvement. :)
Does this PR introduce any user-facing change? No
We didn't have a limit before, didn't we? Right now user might get an exception when the limit is exceeded
http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-5063/13/tests/
org.apache.iceberg.mr.hive.TestConflictingDataFiles::testMultiFiltersUpdate
I think this is a flaky test. I have seen this faliure many times.
Run a flaky check to see results. http://ci.hive.apache.org/job/hive-flaky-check/831/
http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-5063/13/tests/
org.apache.iceberg.mr.hive.TestConflictingDataFiles::testMultiFiltersUpdate
I think this is a flaky test. I have seen this faliure many times. Run a flaky check to see results. http://ci.hive.apache.org/job/hive-flaky-check/831/
http://ci.hive.apache.org/job/hive-flaky-check/831/testReport/ test report shows that this is a flay test. Hi @simhadri-g , do you have any suggestion about this flaky test, as i see you wrote this code not long ago. Should we disable it now? I filed a ticket about the flay test: https://issues.apache.org/jira/browse/HIVE-28153
Thanks.
Quality Gate passed
Issues
14 New issues
0 Accepted issues
Measures
0 Security Hotspots
No data about Coverage
No data about Duplication
Hi @zhangbutao , TestConflictingDataFiles is a concurrency test. It seems to be failing only with format version 1. I am working on checking why its flaky. Thanks!