arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-8991: [C++] Add hash_64 scalar compute function

Open drin opened this issue 3 years ago • 44 comments

We would like to expose hashing functions via new compute functions. This PR would add a scalar compute function, FastHash32, which uses Hashing32::HashMultiColumn to compute a 32-bit hash for each row of the input array (or scalar).

This PR focuses on 32-bit hashing, but I would like to use this PR to help decide if a different design is preferable before adding 64-bit hashing. I also am not sure what should be in the unit test except for code coverage.

Potential design changes:

  • Using an Options argument.
  • Using an init function for the kernel (but I don't know the input length until run time).
  • Changing how hashing is used.

Confusion about unit tests:

  • I'm not sure how to validate the hash outputs are "as expected". Further, unit tests for the hashing functions don't seem to validate hash outputs.
  • I'm not sure validating the output data types is necessary.
  • Code coverage of the various input types seems to be the most important, but I need help to figure out the best way to approach this.

Edit (8/9/22):

This PR has been open awhile but seems to have hit a good level of functionality. This PR does not:

  • address whether the hash functions are "acceptable" statistically
  • comprehensively benchmark, or test, the various data types

I think these can be addressed in future improvements:

drin avatar Jun 30 '22 21:06 drin

https://issues.apache.org/jira/browse/ARROW-8991

github-actions[bot] avatar Jun 30 '22 21:06 github-actions[bot]

I'm not sure how to validate the hash outputs are "as expected". Further, unit tests for the hashing functions don't seem to validate hash outputs.

For a hashing function I would expect:

  • If two values are equal then their hashes are equal
  • Given a random selection of non-equal values there should be some kind of expected false positive rate (e.g. equal hashes on unequal values). Ideally we would include, as part of this, a benchmark that measures the FPR on random values. You could then take then, pick a safe threshold (e.g. if the benchmark tends to show a 5% FPR then pick 10%) and put that into the unit test (e.g. assert the FPR is less than the safe threshold).

westonpace avatar Jun 30 '22 21:06 westonpace

https://issues.apache.org/jira/browse/ARROW-16945

github-actions[bot] avatar Jun 30 '22 21:06 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Jun 30 '22 21:06 github-actions[bot]

@save-buffer would you mind reviewing this when you get a chance?

westonpace avatar Jun 30 '22 21:06 westonpace

I renamed this PR for ARROW-16945, so that I can track just a 32-bit version for now. Not sure if this will confuse some of the automated issue tracking/associations

drin avatar Jun 30 '22 21:06 drin

  • Given a random selection of non-equal values

I'd add that non-random but likely selections should also show a nice hash distribution. Including:

  • ranges of consecutive [0, N] numbers
  • clustered strings (for example a subset of /usr/share/dict/words, or something generated that looks like that)

Also, ideally, the bottom K bits of the hash should also keep those nice statistical properties (so that a hash table implementation can simply use a power-of-two modulo).

The implementations from arrow/util/hashing.h try to satisfy those requirements, btw, so you could reuse them.

pitrou avatar Jul 19 '22 09:07 pitrou

Why 32-bit rather than 64-bit?

pitrou avatar Jul 19 '22 09:07 pitrou

Why 32-bit rather than 64-bit?

I think Weston mentioned 32-bit in passing and I just made a semi-arbitrary decision to start there. I plan on following this up with a 64-bit version, though not sure if other things should be addressed before I do so (memory management, etc.).

drin avatar Jul 19 '22 17:07 drin

I'd add that non-random but likely selections should also show a nice hash distribution. Including: The implementations from arrow/util/hashing.h try to satisfy those requirements, btw, so you could reuse them.

It would be great, by the way, to have benchmarks of the two hashing utilities we have. I believe the utilities we have in key_hash.h trade off some distribution performance in favor of runtime performance when compared with the utilities in hashing.h. It would be nice to have some objective measures of this tradeoff.

Also, can we expand in the function doc that these hashes are not suitable for cryptographic purposes?

westonpace avatar Jul 19 '22 22:07 westonpace

There's https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/hashing_benchmark.cc at least

pitrou avatar Jul 19 '22 22:07 pitrou

Thanks @pitrou , I'll start with that as a base and add benchmarks that use the utilities from key_hash.h (I'll see if these also exist elsewhere).

I recently was exposed to hashing.h when looking at the count_distinct function, so I should be able to compare them. Maybe I'll skeleton another compute function, StandardHash (arbitrary, temporary name to distinguish from FastHash), and we can use it to better consider how many hash functions we should have vs what should be knobs available via an Options class.

drin avatar Jul 19 '22 23:07 drin

rebased

drin avatar Jul 20 '22 03:07 drin

I wanted to update with initial changes to hashing_benchmark.cc. To make sure comparisons covered at least a few cases, I decided to add FastHash64, although it is currently very copy pasta.

Barring code style/organization improvements, I was hoping to get some feedback on whether these benchmarks are fundamentally what we would like to see. Some questions I have:

  • how do I ensure I am not including the time to construct an arrow::Array from a vector in the benchmark time?
  • do we want to time calling the function, or just the mechanisms that the compute function wraps?
    • do we want to do both to also capture any overhead in compute kernel invocation?

drin avatar Jul 20 '22 18:07 drin

Also, I see now that util/hashing.h wraps xxHash, vendored from Cyan4973's github repo. Hashing32 has the following comment above it:

// Implementations are based on xxh3 32-bit algorithm description from:
// https://github.com/Cyan4973/xxHash/blob/dev/doc/xxhash_spec.md

I need to experiment a bit, but I suspect that for int primitives, Hashing32 and Hashing64 just wrap xxHash and add logic for iterating over rows and columns. Specifically, key_hash.cc#L745 looks a lot like hashing.h#L88. hashing_benchmark.cc#L70 benchmarks using 2 multipliers, and I think if it uses just the 1 multiplier then the implementation looks a lot like Hashing64.

All this to say that I think it should be accommodated in the benchmark, but I also wonder if some of the hashing code can be simplified/better organized in case they aren't as independent as they seem at first glance.

cc @michalursa

drin avatar Jul 20 '22 18:07 drin

Here is my understanding, which may not be complete:

util/hashing.h is a more-or-less direct adapter between Arrow and xxhash.

key_hash.cc is based on xxhash but most of the algorithms have been reimplemented to some degree. The key_hash.cc utilities are, in theory, better suited to take advantage of columnar formats and/or vectorized CPUs. However, as part of this rewrite, certain sacrifices were made in hash quality, in favor of performance. This is primarily (only?) used today in Acero for the hash-join and hash-aggregate.

In theory, util/hashing.cc should have a better distribution but worse performance than key_hash.cc.

westonpace avatar Jul 20 '22 19:07 westonpace

Weston is correct, the key_hash interface is only meant for internal use of hash tables. It subtly doesn't match xxh in some cases (iirc it for example can pad with 0's instead of having special handling for arbitrary lengths). The interface is also optimized for hashing in mini batches and reusing as much memory as possible. In general I would suggest using util/hashing instead of key_hash, as key_hash is for internal use; we may want to change it at any time (e.g. we had an idea to see if we can make a really crappy hash that's very fast to compute but is "good enough" for a hash table).

I can leave a review of the current code, but I'd suggest moving away from the key_hash interface overall.

save-buffer avatar Jul 20 '22 19:07 save-buffer

Biggest feedback is to not allocate a large temporary std::vector and instead write directly into the buffer you'll be returning. Also you made your TempVectorStack way too big.

thanks! I had troubles with smaller sizes and tried to not size it too large, but I'll revisit.

For writing directly into the buffer to return, I assume that means I need to create an ArrayData? is there a better interface via Builders?

drin avatar Jul 20 '22 21:07 drin

never mind, you actually mentioned how to do it already. the mobile interface hid it

drin avatar Jul 20 '22 21:07 drin

util/hashing.h is a more-or-less direct adapter between Arrow and xxhash.

That's how it ended up working in practice for non-tiny strings, but the basic objective is to have some fast though statistically good hash functions.

pitrou avatar Jul 21 '22 07:07 pitrou

@drin could you please add entries for these functions in the Compute Functions docs (cpp/compute.rst)? Thanks!

ianmcook avatar Jul 21 '22 22:07 ianmcook

@ursabot please benchmark lang=C++

drin avatar Jul 22 '22 00:07 drin

Benchmark runs are scheduled for baseline = 4d931ff1c0f5661a9b134c514555c8d769001759 and contender = 3c1fd3b03fa2b143d582244ca1bb93fbc0c84bcf. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.69% :arrow_up:0.17%] test-mac-arm [Skipped :warning: Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] ursa-i9-9960x [Finished :arrow_down:0.46% :arrow_up:0.0%] ursa-thinkcentre-m75q Buildkite builds: [Finished] 3c1fd3b0 test-mac-arm [Finished] 3c1fd3b0 ursa-thinkcentre-m75q [Failed] 4d931ff1 ec2-t3-xlarge-us-east-2 [Failed] 4d931ff1 test-mac-arm [Failed] 4d931ff1 ursa-i9-9960x [Finished] 4d931ff1 ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Jul 22 '22 00:07 ursabot

before figuring out the conbench stuff, I put some of the local benchmark results in a google slide:

https://docs.google.com/presentation/d/1cUU_F3jB6LsOLbClhl34YdQiodtbz7l76l3juHTsC5k/edit#slide=id.g13e9d117f47_0_63

Just wanted to put this in as a sneak peek even though the test size is small (10,000 items) and this is only for a couple primitives.

drin avatar Jul 22 '22 00:07 drin

@ursabot please benchmark lang=C++

drin avatar Jul 22 '22 16:07 drin

Commit 3c1fd3b03fa2b143d582244ca1bb93fbc0c84bcf already has scheduled benchmark runs.

ursabot avatar Jul 22 '22 16:07 ursabot

sorry, renaming to match the PR back to ARROW-8991. Since I am already trying to accommodate 64-bit, it didn't make sense to keep the smaller scope that the sub-task was meant to capture. I'll try not to scope creep too much though

drin avatar Aug 02 '22 06:08 drin

This PR is ready for review. based on how long this has been opened, I tried to get important coverage through util/hashing_benchmark.cc and handle the main data types.

This PR does not:

  • address whether the hash functions are "acceptable" statistically
  • comprehensively benchmark, or test, the various data types

I think this PR is mostly a first pass at creating some hash functions that can be called through the compute API.

I will create a few issues related to ARROW-8991 to track future improvements

drin avatar Aug 09 '22 21:08 drin

@ursabot please benchmark lang=C++

drin avatar Aug 10 '22 21:08 drin

Benchmark runs are scheduled for baseline = 4d931ff1c0f5661a9b134c514555c8d769001759 and contender = 737387bf983bdf33a567d042b460c213cfdc03c6. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.48% :arrow_up:2.04%] test-mac-arm [Skipped :warning: Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] ursa-i9-9960x [Finished :arrow_down:2.7% :arrow_up:1.17% :warning: Contender and baseline run contexts do not match] ursa-thinkcentre-m75q Buildkite builds: [Finished] 737387bf test-mac-arm [Finished] 737387bf ursa-thinkcentre-m75q [Failed] 4d931ff1 ec2-t3-xlarge-us-east-2 [Failed] 4d931ff1 test-mac-arm [Failed] 4d931ff1 ursa-i9-9960x [Finished] 4d931ff1 ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Aug 10 '22 21:08 ursabot