druid
druid copied to clipboard
Add "stringEncoding" parameter to DataSketches HLL.
Builds on the concept from #11172 and adds a way to feed HLL sketches with UTF-8 bytes.
This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default.
Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches.
Benchmarks
These are benchmarks of APPROX_COUNT_DISTINCT_DS_HLL(<string column>)
on a 500k row segment. The values that appear are generally short (a few characters long).
patch - utf8
Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units
SqlBenchmark.querySql 1 500000 false avgt 15 27.639 ± 1.221 ms/op
SqlBenchmark.querySql 1 500000 force avgt 15 25.362 ± 2.406 ms/op
patch - utf16le
Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units
SqlBenchmark.querySql 1 500000 false avgt 15 40.828 ± 2.855 ms/op
SqlBenchmark.querySql 1 500000 force avgt 15 39.257 ± 5.548 ms/op
master
Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units
SqlBenchmark.querySql 1 500000 false avgt 15 43.704 ± 3.875 ms/op
SqlBenchmark.querySql 1 500000 force avgt 15 45.293 ± 3.045 ms/op
Testing strategy
- Extended HllSketchAggregatorTest to run on a matrix of (vectorized, nonvectorized) x (utf8, utf16le).
- Added specific sketch image expectations to HllSketchAggregatorTest and manually verified them against master, to ensure consistency across versions.
- Added serde tests to make sure the new field JSON-round-trips properly.
This pull request introduces 1 alert when merging e27cc1cb3de02dec5dadec0e398b445a7132045f into 99f39c720261923a4f908ad2ae33b03b372989b0 - view on LGTM.com
new alerts:
- 1 for Inconsistent equals and hashCode
This pull request introduces 1 alert when merging 492edd34f67378656d64c260e1b46cb8a320ce29 into 34169c8550dbe0a3fc9043aa21d221b9c7502039 - view on LGTM.com
new alerts:
- 1 for Inconsistent equals and hashCode
This pull request introduces 1 alert when merging 0d7d9aad781528b09013c91bc95699f011f314a7 into 34169c8550dbe0a3fc9043aa21d221b9c7502039 - view on LGTM.com
new alerts:
- 1 for Inconsistent equals and hashCode
@gianm I want to clarify a comment made above.
HLL sketches used UTF-16LE encoding when hashing strings.
This is not correct, at least for the HLL in datasketches-java (I'm not sure what the Druid adaptor does). Strings are encoded using UTF-8 and have been for as long as I can remember. If you wish to use UTF-16, you just convert your string to char[] and the HLL sketch will accept that as well. The sketch really doesn't care what the string encoding is, it is either looking at the input as a stream of byte[] or char[]. The UTF-8 encoding was specified in the string update method to help users ensure consistency (if the string happened to be encoded in something else). Nonetheless, whatever you decide, you will always need to stick with your choice. Otherwise, you will destroy the unique identity of whatever you are feeding the sketch. As a result counts, merging, etc will be meaningless!
I have some comments about PR 353 but I want to make these in the actual PR.
Lee.
This is not correct, at least for the HLL in datasketches-java (I'm not sure what the Druid adaptor does). Strings are encoded using UTF-8 and have been for as long as I can remember. If you wish to use UTF-16, you just convert your string to char[] and the HLL sketch will accept that as well.
@leerho Understood, but it is true as far as Druid is concerned — the HllSketch-based aggregator implementation in Druid does update(s.toCharArray())
not update(s)
: https://github.com/apache/druid/blob/8296123d895db7d06bc4517db5e767afb7862b83/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/hll/HllSketchBuildAggregator.java#L103
Nonetheless, whatever you decide, you will always need to stick with your choice.
Yep, that's why this must be an option and the choice needs to be made in a consistent way.
I have some comments about PR 353 but I want to make these in the actual PR.
Thanks!
Pushed this stuff:
- fixes for the issues @abhishekagarwal87 found
- update to use the new API from https://github.com/apache/datasketches-java/pull/353
- improve test coverage for updateSketch and updateSketchWithDictionarySelector
This pull request introduces 1 alert when merging 53a708fa3f29d517260f954c87288da934dd9222 into 44a7b091908f8602d28ce4bffe791764e0ee57ad - view on LGTM.com
new alerts:
- 1 for Inconsistent equals and hashCode
This pull request introduces 1 alert when merging f7ff8d4c9e80492bf7b43ecc6cbd15da5cd1efbd into 44a7b091908f8602d28ce4bffe791764e0ee57ad - view on LGTM.com
new alerts:
- 1 for Inconsistent equals and hashCode
This pull request introduces 1 alert when merging dc373eb9f69c5d985a802d0e7f22f91fd895876c into 223c5692a8292a210a770dc0702f75b4484ae347 - view on LGTM.com
new alerts:
- 1 for Inconsistent equals and hashCode
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.
This pull request/issue is no longer marked as stale.
Resolved conflicts with master.
This pull request introduces 1 alert when merging dacd05e0b4028d1c0aba8b3d721094311183b0b7 into 7ab21708021cb534ad34608b3fb06cb0273c2cd2 - view on LGTM.com
new alerts:
- 1 for Inconsistent equals and hashCode
All tests passed except standard-its / (Compile=openjdk8, Run=openjdk8, Cluster Build On K8s) ITNestedQueryPushDownTest integration test
, which has been weird in other PRs as well, and I don't believe would be related to the code in this patch. So, I'll merge it.