GH-17211: [C++] Add `hash32` and `hash64` scalar compute functions
Rationale for this change
Support for calculating elementwise hashes.
The PR adds to scalar functions hash32() and hash64() using the existing internal hashing machinery.
What changes are included in this PR?
Continuation of https://github.com/apache/arrow/pull/39836 with the following changes:
- Use column oriented hash-combine rather than flattening nested elements
- Support arbitrary nesting levels with an optimization that only hash child arrays if they are also nested
Are these changes tested?
Partially, working on a proper testing suite.
Are there any user-facing changes?
There is a new compute kernel hash_64 available.
- GitHub Issue: #17211
Seems like we generate the same hash for both NULL and 0 which is not ideal.
In [1]: import pyarrow as pa
In [2]: import pyarrow.compute as pc
In [3]: pc.hash_64([None])
Out[3]:
<pyarrow.lib.UInt64Array object at 0x124247be0>
[
0
]
In [4]: pc.hash_64([0])
Out[4]:
<pyarrow.lib.UInt64Array object at 0x1033027a0>
[
0
]
Seems like we generate the same hash for both
NULLand0which is not ideal.In [1]: import pyarrow as pa In [2]: import pyarrow.compute as pc In [3]: pc.hash_64([None]) Out[3]: <pyarrow.lib.UInt64Array object at 0x124247be0> [ 0 ] In [4]: pc.hash_64([0]) Out[4]: <pyarrow.lib.UInt64Array object at 0x1033027a0> [ 0 ]
@pitrou NULLs are explicitly hashed as 0, but 0::int also hashes into 0 due to https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/key_hash_internal.cc#L304, any idea how could we overcome this to generate unique hashes for both NULL and 0 without performance regression?
Well, you could just add a constant, e.g.:
template <bool T_COMBINE_HASHES, typename T>
void Hashing32::HashIntImp(uint32_t num_keys, const T* keys, uint32_t* hashes) {
constexpr uint64_t kMultiplier = 11400714785074694791ULL;
constexpr uint64_t kAddend = 9756277977048271785ULL;
for (uint32_t ikey = 0; ikey < num_keys; ++ikey) {
uint64_t x = static_cast<uint64_t>(keys[ikey]) + kAddend;
uint32_t hash = static_cast<uint32_t>(BYTESWAP(x * kMultiplier));
if (T_COMBINE_HASHES) {
hashes[ikey] = CombineHashesImp(hashes[ikey], hash);
} else {
hashes[ikey] = hash;
}
}
}
Well, you could just add a constant, e.g.:
Thanks, that was my idea as well just didn't want to add an extra operation.
Just asking - any idea if there is line of sight to this issue being closed? I am interested because, among other things, I believe this Arrow functionality is the underlying reason why pandas cannot currently group by a pyarrow-backed struct series.
Just asking - any idea if there is line of sight to this issue being closed? I am interested because, among other things, I believe this Arrow functionality is the underlying reason why pandas cannot currently group by a pyarrow-backed struct series.
The PR should be ready now. Do you have a reference to the relevant pandas issue perhaps?
@kszucs I couldn’t find an existing pandas issue so I created one, see https://github.com/pandas-dev/pandas/issues/60594#issuecomment-2557730650
@zanmato1984 @pitrou if you have any time, could you review this? the relevant issue (#17211) got flagged as stale, but I think this is still desirable
@kszucs Could you rebase on main?
@kou sure, rebased.
@kou let me know if there is something missing. We can add support for the missing types in follow-ups.
Sorry for not noticing the ping for review and this unfortunately missed MS 21.0. I'll review again shortly.