hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-26221: Add histogram-based column statistics

Open asolimando opened this issue 2 years ago • 2 comments

What changes were proposed in this pull request?

See the Jira ticket.

Why are the changes needed?

See the Jira ticket.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile="compute_kll_sketch.q,sketches_kll_test.q,stats_histogram.q,stats_histogram2.q" -Dtest.output.overwrite -pl itests/qtest -Pitests

mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile="sketches_rewrite_rank_partition_by.q,sketches_rewrite_rank.q,sketches_rewrite_percentile_disc.q,sketches_rewrite_ntile_partition_by.q,sketches_rewrite_ntile.q,sketches_rewrite_cume_dist_partition_by.q,sketches_rewrite_cume_dist.q,sketches_materialized_view_rank.q,sketches_materialized_view_percentile_disc.q,sketches_materialized_view_ntile.q,sketches_materialized_view_cume_dist.q" -Dtest.output.overwrite -pl itests/qtest -Pitests

mvn test -Dtest=LongColumnStatsAggregatorTest -pl standalone-metastore/metastore-server
mvn test -Dtest.groups=org.apache.hadoop.hive.metastore.annotation.MetastoreCheckinTest -Dtest=TestCachedStoreUpdateUsingEvents.java -pl itests/hive-unit -Pitests
mvn test -Dtest.groups=org.apache.hadoop.hive.metastore.annotation.MetastoreCheckinTest -Dtest=TestPartitionStat -pl standalone-metastore/metastore-server
mvn test -Dtest.groups=org.apache.hadoop.hive.metastore.annotation.MetastoreCheckinTest -Dtest=TestCachedStore -pl standalone-metastore/metastore-server
mvn test -Dtest=TestObjectStore -pl standalone-metastore/metastore-server

asolimando avatar Mar 24 '22 12:03 asolimando

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.

github-actions[bot] avatar Sep 20 '22 00:09 github-actions[bot]

Please keep the PR open

asolimando avatar Sep 20 '22 05:09 asolimando

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 36 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

sonarqubecloud[bot] avatar Nov 24 '22 23:11 sonarqubecloud[bot]

Thanks @asolimando, leave some minor comments, and the changes look good to me

dengzhhu653 avatar Dec 08 '22 13:12 dengzhhu653

Another question I don't see here is how we generate the histogram statistics? by issuing an "analyze table" command?

dengzhhu653 avatar Dec 08 '22 13:12 dengzhhu653

Another question I don't see here is how we generate the histogram statistics? by issuing an "analyze table" command?

That was hard to figure out for me too at first. Statistics computation happens via an aggregate query, where different UDAFs are used to compute the different statistics.

ColumnStatsSemanticAnalyzer.java#L308-L325 generates the SELECT statement for the stats.

It's then calling ColumnStatsSemanticAnalyzer.java#L327 which has an enum with the different statistics, what we did was to add a new one for histograms and generated the code accordingly (see ColumnStatsSemanticAnalyzer.java#L355-L357).

Finally, the UDAF part is generated here: ColumnStatsSemanticAnalyzer.java#L494-L519.

asolimando avatar Dec 08 '22 14:12 asolimando

Rebased on master and force-pushed to update the hive_metastore.proto file introduced in https://issues.apache.org/jira/browse/HIVE-26484

asolimando avatar Dec 09 '22 10:12 asolimando

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 39 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

sonarqubecloud[bot] avatar Dec 13 '22 07:12 sonarqubecloud[bot]