arrow GH-17211: [C++] Add `hash32` and `hash64` scalar compute functions

Rationale for this change

Support for calculating elementwise hashes.

The PR adds to scalar functions hash32() and hash64() using the existing internal hashing machinery.

What changes are included in this PR?

Continuation of https://github.com/apache/arrow/pull/39836 with the following changes:

Use column oriented hash-combine rather than flattening nested elements
Support arbitrary nesting levels with an optimization that only hash child arrays if they are also nested

Are these changes tested?

Partially, working on a proper testing suite.

Are there any user-facing changes?

There is a new compute kernel hash_64 available.

GitHub Issue: #17211

Dec 11 '24 12:12 kszucs

Seems like we generate the same hash for both NULL and 0 which is not ideal.

In [1]: import pyarrow as pa

In [2]: import pyarrow.compute as pc

In [3]: pc.hash_64([None])
Out[3]:
<pyarrow.lib.UInt64Array object at 0x124247be0>
[
  0
]

In [4]: pc.hash_64([0])
Out[4]:
<pyarrow.lib.UInt64Array object at 0x1033027a0>
[
  0
]

Dec 11 '24 13:12 kszucs

Seems like we generate the same hash for both NULL and 0 which is not ideal.

In [1]: import pyarrow as pa

In [2]: import pyarrow.compute as pc

In [3]: pc.hash_64([None])
Out[3]:
<pyarrow.lib.UInt64Array object at 0x124247be0>
[
  0
]

In [4]: pc.hash_64([0])
Out[4]:
<pyarrow.lib.UInt64Array object at 0x1033027a0>
[
  0
]

@pitrou NULLs are explicitly hashed as 0, but 0::int also hashes into 0 due to https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/key_hash_internal.cc#L304, any idea how could we overcome this to generate unique hashes for both NULL and 0 without performance regression?

Dec 13 '24 15:12 kszucs

Well, you could just add a constant, e.g.:

template <bool T_COMBINE_HASHES, typename T>
void Hashing32::HashIntImp(uint32_t num_keys, const T* keys, uint32_t* hashes) {
  constexpr uint64_t kMultiplier = 11400714785074694791ULL;
  constexpr uint64_t kAddend = 9756277977048271785ULL;
  for (uint32_t ikey = 0; ikey < num_keys; ++ikey) {
    uint64_t x = static_cast<uint64_t>(keys[ikey]) + kAddend;
    uint32_t hash = static_cast<uint32_t>(BYTESWAP(x * kMultiplier));

    if (T_COMBINE_HASHES) {
      hashes[ikey] = CombineHashesImp(hashes[ikey], hash);
    } else {
      hashes[ikey] = hash;
    }
  }
}

Dec 13 '24 16:12 pitrou

Well, you could just add a constant, e.g.:

Thanks, that was my idea as well just didn't want to add an extra operation.

Dec 16 '24 14:12 kszucs

Just asking - any idea if there is line of sight to this issue being closed? I am interested because, among other things, I believe this Arrow functionality is the underlying reason why pandas cannot currently group by a pyarrow-backed struct series.

Dec 20 '24 01:12 a-reich

Just asking - any idea if there is line of sight to this issue being closed? I am interested because, among other things, I believe this Arrow functionality is the underlying reason why pandas cannot currently group by a pyarrow-backed struct series.

The PR should be ready now. Do you have a reference to the relevant pandas issue perhaps?

Dec 20 '24 09:12 kszucs

@kszucs I couldn’t find an existing pandas issue so I created one, see https://github.com/pandas-dev/pandas/issues/60594#issuecomment-2557730650

Dec 21 '24 03:12 a-reich

@zanmato1984 @pitrou if you have any time, could you review this? the relevant issue (#17211) got flagged as stale, but I think this is still desirable

Jun 20 '25 17:06 drin

@kszucs Could you rebase on main?

Jun 20 '25 20:06 kou

@kou sure, rebased.

Jun 20 '25 21:06 kszucs

@kou let me know if there is something missing. We can add support for the missing types in follow-ups.

Jun 24 '25 13:06 kszucs

Sorry for not noticing the ping for review and this unfortunately missed MS 21.0. I'll review again shortly.

Jul 17 '25 10:07 zanmato1984