arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-17211: [C++] Add `hash32` and `hash64` scalar compute functions

Open kszucs opened this issue 1 year ago • 11 comments

Rationale for this change

Support for calculating elementwise hashes.

The PR adds to scalar functions hash32() and hash64() using the existing internal hashing machinery.

What changes are included in this PR?

Continuation of https://github.com/apache/arrow/pull/39836 with the following changes:

  • Use column oriented hash-combine rather than flattening nested elements
  • Support arbitrary nesting levels with an optimization that only hash child arrays if they are also nested

Are these changes tested?

Partially, working on a proper testing suite.

Are there any user-facing changes?

There is a new compute kernel hash_64 available.

  • GitHub Issue: #17211

kszucs avatar Dec 11 '24 12:12 kszucs

Seems like we generate the same hash for both NULL and 0 which is not ideal.

In [1]: import pyarrow as pa

In [2]: import pyarrow.compute as pc

In [3]: pc.hash_64([None])
Out[3]:
<pyarrow.lib.UInt64Array object at 0x124247be0>
[
  0
]

In [4]: pc.hash_64([0])
Out[4]:
<pyarrow.lib.UInt64Array object at 0x1033027a0>
[
  0
]

kszucs avatar Dec 11 '24 13:12 kszucs

Seems like we generate the same hash for both NULL and 0 which is not ideal.

In [1]: import pyarrow as pa

In [2]: import pyarrow.compute as pc

In [3]: pc.hash_64([None])
Out[3]:
<pyarrow.lib.UInt64Array object at 0x124247be0>
[
  0
]

In [4]: pc.hash_64([0])
Out[4]:
<pyarrow.lib.UInt64Array object at 0x1033027a0>
[
  0
]

@pitrou NULLs are explicitly hashed as 0, but 0::int also hashes into 0 due to https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/key_hash_internal.cc#L304, any idea how could we overcome this to generate unique hashes for both NULL and 0 without performance regression?

kszucs avatar Dec 13 '24 15:12 kszucs

Well, you could just add a constant, e.g.:

template <bool T_COMBINE_HASHES, typename T>
void Hashing32::HashIntImp(uint32_t num_keys, const T* keys, uint32_t* hashes) {
  constexpr uint64_t kMultiplier = 11400714785074694791ULL;
  constexpr uint64_t kAddend = 9756277977048271785ULL;
  for (uint32_t ikey = 0; ikey < num_keys; ++ikey) {
    uint64_t x = static_cast<uint64_t>(keys[ikey]) + kAddend;
    uint32_t hash = static_cast<uint32_t>(BYTESWAP(x * kMultiplier));

    if (T_COMBINE_HASHES) {
      hashes[ikey] = CombineHashesImp(hashes[ikey], hash);
    } else {
      hashes[ikey] = hash;
    }
  }
}

pitrou avatar Dec 13 '24 16:12 pitrou

Well, you could just add a constant, e.g.:

Thanks, that was my idea as well just didn't want to add an extra operation.

kszucs avatar Dec 16 '24 14:12 kszucs

Just asking - any idea if there is line of sight to this issue being closed? I am interested because, among other things, I believe this Arrow functionality is the underlying reason why pandas cannot currently group by a pyarrow-backed struct series.

a-reich avatar Dec 20 '24 01:12 a-reich

Just asking - any idea if there is line of sight to this issue being closed? I am interested because, among other things, I believe this Arrow functionality is the underlying reason why pandas cannot currently group by a pyarrow-backed struct series.

The PR should be ready now. Do you have a reference to the relevant pandas issue perhaps?

kszucs avatar Dec 20 '24 09:12 kszucs

@kszucs I couldn’t find an existing pandas issue so I created one, see https://github.com/pandas-dev/pandas/issues/60594#issuecomment-2557730650

a-reich avatar Dec 21 '24 03:12 a-reich

@zanmato1984 @pitrou if you have any time, could you review this? the relevant issue (#17211) got flagged as stale, but I think this is still desirable

drin avatar Jun 20 '25 17:06 drin

@kszucs Could you rebase on main?

kou avatar Jun 20 '25 20:06 kou

@kou sure, rebased.

kszucs avatar Jun 20 '25 21:06 kszucs

@kou let me know if there is something missing. We can add support for the missing types in follow-ups.

kszucs avatar Jun 24 '25 13:06 kszucs

Sorry for not noticing the ping for review and this unfortunately missed MS 21.0. I'll review again shortly.

zanmato1984 avatar Jul 17 '25 10:07 zanmato1984