hive
hive copied to clipboard
HIVE-26243 ds kll sketch vectorized
What changes were proposed in this pull request?
We add a vectorized implementation for the ds_kll_sketch
UDAF
Why are the changes needed?
When this UDAF is used either alone or at the side of other vectorizable functions, it will benefit from a performance speed-up.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile="compute_kll_sketch.q" -Dtest.output.overwrite -pl itests/qtest -Pitests
and
mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile="sketches_rewrite_rank_partition_by.q,sketches_rewrite_rank.q,sketches_rewrite_percentile_disc.q,sketches_rewrite_ntile_partition_by.q,sketches_rewrite_ntile.q,sketches_rewrite_cume_dist_partition_by.q,sketches_rewrite_cume_dist.q,sketches_materialized_view_rank.q,sketches_materialized_view_percentile_disc.q,sketches_materialized_view_ntile.q,sketches_materialized_view_cume_dist.q" -Dtest.output.overwrite -pl itests/qtest -Pitests
Before going into the single discussions, the general answer to all the above comments boils down to "I am trying to keep consistency with what was done here for vectorizing HyperLogLog function": https://github.com/apache/hive/pull/1824/files
I sense that you don't like how that PR was designed, but since they are very close in spirit, and that their code is used side by side, I thought it was important to keep them consistent.
If we need to rework the current PR, they won't match anymore, unless we rework the HLL design and implementation too, and this has its own share of cons...
Assuming we go for the refactoring, most of the comments are too sketchy to give appropriate guidance over an alternative design/implementation, I will need to ask you to elaborate more on them.
For instance, you seem to be suggesting to remove all helper classes/methods etc. Since it does not seem feasible to inline all the code now sitting in the helper methods/classes directly in the vectorized implementation, I guess you want to place it someplace else, but I can't really decide based on your comment.
For the couple of currently unused methods, I will need them in a PR depending on this one: https://issues.apache.org/jira/browse/HIVE-26221: I can remove them now and re-introduce them later, if preferable. Once again they mimic HLL methods (both naming and usage, since HLL and KLL methods will be used side by side in most places, it helps reading what's happening, see LongColumnStatsAggregator.java#L104-L111, for instance).
PR-1824 has nothing to do with datasketches; I don't know how you followed it's conventions but you might end up in trouble...because DS also has a HLL implementation...wouldn't that conflict with the existing one?
note: PR-1824 named the file VectorUDAFComputeBitVector.txt
and internally named the method compute_bit_vector_hll
; I think the class name should have contained the Hll
keyword
I think that the file name VectorUDAFComputeKLL.txt
is not connected at all to the ds_kll_sketch
function its about to vectorize...and as such its a bit confusing....
The current implementation doesn't really look forward: I think we have 20 sketch function from datasketches already exposed as inside Hive which could be vectorized; I think they are behind the same api cover...so just vectorizing the KLL one without any sight forward and taking "ideas" from the old hll codepath doesn't seem the best idea to me...
grep ^ds_ ql/src/test/results/clientpositive/llap/show_functions.q.out|grep _sketch$
no need to do everything in 1 patch - but this is pretty much just copy-pasting the existing hll txtfile substituted to kll here and there...so we should do that 20 times?
For instance, you seem to be suggesting to remove all helper classes/methods etc I don't think those changes neccessary in the metastore for a vectorization of this function?
HIVE-26221 is something which have changes - but has no real end-user accessible value - and as such I don't think its ready.
PR-1824 has nothing to do with datasketches; I don't know how you followed it's conventions but you might end up in trouble...because DS also has a HLL implementation...wouldn't that conflict with the existing one?
HyperLogLog is one of the few examples of vectorized functions over a "complex" data type, it seemed like a good candidate to me.
note: PR-1824 named the file
VectorUDAFComputeBitVector.txt
and internally named the methodcompute_bit_vector_hll
; I think the class name should have contained theHll
keywordI think that the file name
VectorUDAFComputeKLL.txt
is not connected at all to theds_kll_sketch
function its about to vectorize...and as such its a bit confusing....
I agree on that, maybe VectorUDAF_ds_kll_sketch
would be clearer.
The current implementation doesn't really look forward: I think we have 20 sketch function from datasketches already exposed as inside Hive which could be vectorized; I think they are behind the same api cover...so just vectorizing the KLL one without any sight forward and taking "ideas" from the old hll codepath doesn't seem the best idea to me...
grep ^ds_ ql/src/test/results/clientpositive/llap/show_functions.q.out|grep _sketch$
no need to do everything in 1 patch - but this is pretty much just copy-pasting the existing hll txtfile substituted to kll here and there...so we should do that 20 times?
Sketches are behind the same api for methods like update()
and serialize()
, but they also have differences, like the supported constructors, it seems hard to have a uniform UDAF for all of them. I have omitted the constructor for supporting the K param of Kll sketches in the current PR, but it will be needed later on (I have implemented it as suggested here: https://github.com/apache/hive/pull/1824#issuecomment-756588127), so we will have to have a specialized version of the KLL UDAF anyway, and the same for all the other 9 sketches, I don't see a way out of this.
For instance, you seem to be suggesting to remove all helper classes/methods etc I don't think those changes neccessary in the metastore for a vectorization of this function?
HIVE-26221 is something which have changes - but has no real end-user accessible value - and as such I don't think its ready.
HIVE-26221 already sensibly improves range filter estimation for runtime statistics that can be used by Tez, of course exploiting statistics in query planning is the next obvious step, but I am a fan of iterative improvements, especially which such a big patch spanning few modules. Let's have this discussion offline as it is not key here, I just wanted to highlight the needs behind the current patch by linking to this ticket.
I am trying to figure out a better place to put an interface plus the concrete implementation(s) of the "complex" object incapsulating the different sketches: the only examples I could find are HyperLogLog
(stored in the metastore-server
module) and BloomFilter
/BloomKFilter
(stored in the storage-api
module). As of now I have followed the HyperLogLog
example but you seem to disagree, storage-api
seems pretty arbitrary too, so I am not sure how to proceed here, do you have any suggestions?
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.
Please keep the PR open
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.
Please keep it open
Rebased on master in order to take advantage of PR-3637 to support KLL sketch size parameter in the constructor (otherwise we will miss vectorization opportunities when the UDAF is used in its version specifying the sketch size). Commits implementing this will come shortly.
Rebasing on master
@deniskuzZ tests are green, ready to be merged when you have a moment, thanks!