hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-26243 ds kll sketch vectorized

Open asolimando opened this issue 2 years ago • 5 comments

What changes were proposed in this pull request?

We add a vectorized implementation for the ds_kll_sketch UDAF

Why are the changes needed?

When this UDAF is used either alone or at the side of other vectorizable functions, it will benefit from a performance speed-up.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile="compute_kll_sketch.q" -Dtest.output.overwrite -pl itests/qtest -Pitests

and

mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile="sketches_rewrite_rank_partition_by.q,sketches_rewrite_rank.q,sketches_rewrite_percentile_disc.q,sketches_rewrite_ntile_partition_by.q,sketches_rewrite_ntile.q,sketches_rewrite_cume_dist_partition_by.q,sketches_rewrite_cume_dist.q,sketches_materialized_view_rank.q,sketches_materialized_view_percentile_disc.q,sketches_materialized_view_ntile.q,sketches_materialized_view_cume_dist.q" -Dtest.output.overwrite -pl itests/qtest -Pitests

asolimando avatar May 23 '22 17:05 asolimando

Before going into the single discussions, the general answer to all the above comments boils down to "I am trying to keep consistency with what was done here for vectorizing HyperLogLog function": https://github.com/apache/hive/pull/1824/files

I sense that you don't like how that PR was designed, but since they are very close in spirit, and that their code is used side by side, I thought it was important to keep them consistent.

If we need to rework the current PR, they won't match anymore, unless we rework the HLL design and implementation too, and this has its own share of cons...

Assuming we go for the refactoring, most of the comments are too sketchy to give appropriate guidance over an alternative design/implementation, I will need to ask you to elaborate more on them.

For instance, you seem to be suggesting to remove all helper classes/methods etc. Since it does not seem feasible to inline all the code now sitting in the helper methods/classes directly in the vectorized implementation, I guess you want to place it someplace else, but I can't really decide based on your comment.

For the couple of currently unused methods, I will need them in a PR depending on this one: https://issues.apache.org/jira/browse/HIVE-26221: I can remove them now and re-introduce them later, if preferable. Once again they mimic HLL methods (both naming and usage, since HLL and KLL methods will be used side by side in most places, it helps reading what's happening, see LongColumnStatsAggregator.java#L104-L111, for instance).

asolimando avatar May 31 '22 12:05 asolimando

PR-1824 has nothing to do with datasketches; I don't know how you followed it's conventions but you might end up in trouble...because DS also has a HLL implementation...wouldn't that conflict with the existing one?

note: PR-1824 named the file VectorUDAFComputeBitVector.txt and internally named the method compute_bit_vector_hll ; I think the class name should have contained the Hll keyword

I think that the file name VectorUDAFComputeKLL.txt is not connected at all to the ds_kll_sketch function its about to vectorize...and as such its a bit confusing....

The current implementation doesn't really look forward: I think we have 20 sketch function from datasketches already exposed as inside Hive which could be vectorized; I think they are behind the same api cover...so just vectorizing the KLL one without any sight forward and taking "ideas" from the old hll codepath doesn't seem the best idea to me...

grep ^ds_ ql/src/test/results/clientpositive/llap/show_functions.q.out|grep _sketch$

no need to do everything in 1 patch - but this is pretty much just copy-pasting the existing hll txtfile substituted to kll here and there...so we should do that 20 times?

For instance, you seem to be suggesting to remove all helper classes/methods etc I don't think those changes neccessary in the metastore for a vectorization of this function?

HIVE-26221 is something which have changes - but has no real end-user accessible value - and as such I don't think its ready.

kgyrtkirk avatar May 31 '22 13:05 kgyrtkirk

PR-1824 has nothing to do with datasketches; I don't know how you followed it's conventions but you might end up in trouble...because DS also has a HLL implementation...wouldn't that conflict with the existing one?

HyperLogLog is one of the few examples of vectorized functions over a "complex" data type, it seemed like a good candidate to me.

note: PR-1824 named the file VectorUDAFComputeBitVector.txt and internally named the method compute_bit_vector_hll ; I think the class name should have contained the Hll keyword

I think that the file name VectorUDAFComputeKLL.txt is not connected at all to the ds_kll_sketch function its about to vectorize...and as such its a bit confusing....

I agree on that, maybe VectorUDAF_ds_kll_sketch would be clearer.

The current implementation doesn't really look forward: I think we have 20 sketch function from datasketches already exposed as inside Hive which could be vectorized; I think they are behind the same api cover...so just vectorizing the KLL one without any sight forward and taking "ideas" from the old hll codepath doesn't seem the best idea to me...

grep ^ds_ ql/src/test/results/clientpositive/llap/show_functions.q.out|grep _sketch$

no need to do everything in 1 patch - but this is pretty much just copy-pasting the existing hll txtfile substituted to kll here and there...so we should do that 20 times?

Sketches are behind the same api for methods like update() and serialize(), but they also have differences, like the supported constructors, it seems hard to have a uniform UDAF for all of them. I have omitted the constructor for supporting the K param of Kll sketches in the current PR, but it will be needed later on (I have implemented it as suggested here: https://github.com/apache/hive/pull/1824#issuecomment-756588127), so we will have to have a specialized version of the KLL UDAF anyway, and the same for all the other 9 sketches, I don't see a way out of this.

For instance, you seem to be suggesting to remove all helper classes/methods etc I don't think those changes neccessary in the metastore for a vectorization of this function?

HIVE-26221 is something which have changes - but has no real end-user accessible value - and as such I don't think its ready.

HIVE-26221 already sensibly improves range filter estimation for runtime statistics that can be used by Tez, of course exploiting statistics in query planning is the next obvious step, but I am a fan of iterative improvements, especially which such a big patch spanning few modules. Let's have this discussion offline as it is not key here, I just wanted to highlight the needs behind the current patch by linking to this ticket.

I am trying to figure out a better place to put an interface plus the concrete implementation(s) of the "complex" object incapsulating the different sketches: the only examples I could find are HyperLogLog (stored in the metastore-server module) and BloomFilter/BloomKFilter (stored in the storage-api module). As of now I have followed the HyperLogLog example but you seem to disagree, storage-api seems pretty arbitrary too, so I am not sure how to proceed here, do you have any suggestions?

asolimando avatar Jun 07 '22 11:06 asolimando

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.

github-actions[bot] avatar Aug 07 '22 00:08 github-actions[bot]

Please keep the PR open

asolimando avatar Aug 07 '22 17:08 asolimando

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.

github-actions[bot] avatar Oct 08 '22 00:10 github-actions[bot]

Please keep it open

asolimando avatar Oct 08 '22 06:10 asolimando

Rebased on master in order to take advantage of PR-3637 to support KLL sketch size parameter in the constructor (otherwise we will miss vectorization opportunities when the UDAF is used in its version specifying the sketch size). Commits implementing this will come shortly.

asolimando avatar Oct 26 '22 14:10 asolimando

Rebasing on master

asolimando avatar Nov 19 '22 10:11 asolimando

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

sonarqubecloud[bot] avatar Nov 19 '22 10:11 sonarqubecloud[bot]

@deniskuzZ tests are green, ready to be merged when you have a moment, thanks!

asolimando avatar Nov 19 '22 17:11 asolimando