hive
hive copied to clipboard
HIVE-26221: Add histogram-based column statistics
What changes were proposed in this pull request?
See the Jira ticket.
Why are the changes needed?
See the Jira ticket.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile="compute_kll_sketch.q,sketches_kll_test.q,stats_histogram.q,stats_histogram2.q" -Dtest.output.overwrite -pl itests/qtest -Pitests
mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile="sketches_rewrite_rank_partition_by.q,sketches_rewrite_rank.q,sketches_rewrite_percentile_disc.q,sketches_rewrite_ntile_partition_by.q,sketches_rewrite_ntile.q,sketches_rewrite_cume_dist_partition_by.q,sketches_rewrite_cume_dist.q,sketches_materialized_view_rank.q,sketches_materialized_view_percentile_disc.q,sketches_materialized_view_ntile.q,sketches_materialized_view_cume_dist.q" -Dtest.output.overwrite -pl itests/qtest -Pitests
mvn test -Dtest=LongColumnStatsAggregatorTest -pl standalone-metastore/metastore-server
mvn test -Dtest.groups=org.apache.hadoop.hive.metastore.annotation.MetastoreCheckinTest -Dtest=TestCachedStoreUpdateUsingEvents.java -pl itests/hive-unit -Pitests
mvn test -Dtest.groups=org.apache.hadoop.hive.metastore.annotation.MetastoreCheckinTest -Dtest=TestPartitionStat -pl standalone-metastore/metastore-server
mvn test -Dtest.groups=org.apache.hadoop.hive.metastore.annotation.MetastoreCheckinTest -Dtest=TestCachedStore -pl standalone-metastore/metastore-server
mvn test -Dtest=TestObjectStore -pl standalone-metastore/metastore-server
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.
Please keep the PR open
Thanks @asolimando, leave some minor comments, and the changes look good to me
Another question I don't see here is how we generate the histogram statistics? by issuing an "analyze table" command?
Another question I don't see here is how we generate the histogram statistics? by issuing an "analyze table" command?
That was hard to figure out for me too at first. Statistics computation happens via an aggregate query, where different UDAF
s are used to compute the different statistics.
ColumnStatsSemanticAnalyzer.java#L308-L325 generates the SELECT
statement for the stats.
It's then calling ColumnStatsSemanticAnalyzer.java#L327 which has an enum with the different statistics, what we did was to add a new one for histograms and generated the code accordingly (see ColumnStatsSemanticAnalyzer.java#L355-L357).
Finally, the UDAF part is generated here: ColumnStatsSemanticAnalyzer.java#L494-L519.
Rebased on master and force-pushed to update the hive_metastore.proto
file introduced in https://issues.apache.org/jira/browse/HIVE-26484