arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-17211: [C++] Add hash_64 scalar compute function

Open drin opened this issue 1 year ago • 8 comments

This PR is a fresh continuation of https://github.com/apache/arrow/pull/13487 which was closed purely in favor of this one. The reason for this is that the other PR carried a lot of "development burden" in the form of a long commit history. This fresh version maintains the current state of the code in much fewer commits with a cleaner history. Hopefully this PR will allow others to finish what I started (if I don't finish this first).

The changes still required is to add functional mechanisms for the following:

  • A TempVectorStack for local allocation within the kernel

    • This is necessary to allocate a modified structure of the input data to then be passed to the hashing functions
  • Functional testing of a kernel that can handle a nested structure (e.g. a List[List[utf-8]]).

  • Closes: #17211

drin avatar Jan 29 '24 17:01 drin

@drin are you likely to get some time to push this forward? #32504 is causing us issues and I'd be great to move towards solving it!

judahrand avatar Jun 03 '24 14:06 judahrand

sure! next week I'll have more time and I'll try to squeeze in a bit of time this week.

drin avatar Jun 03 '24 14:06 drin

@judahrand sorry, I ended up taking a break last week, but I happened to remember this today. I will work on this issue this week and next week and my goal will be to have it finished (or close to finished) by end of next week.

I'll try to stage changes so that I can validate on a single nested structure like what is described in #32504 (list<>) and once I have some tests for that working I will try to accommodate arbitrary nesting (list<list<...>>).

drin avatar Jun 19 '24 20:06 drin

rebased on main branch

drin avatar Jun 28 '24 00:06 drin

@judahrand, I wanted to verify something with you since I assume you have an immediate use case in mind for #32504 .

The naive implementation of hashing for a "shallow" nested type (e.g. list<int8>) is relatively straightforward and the code available in this PR should be able to handle that already.

What held up this PR was logic for "deep" nested types (e.g. list<list<int8>>). To move towards addressing this, I am considering the following simple scenario and hoping to get your feedback that it is addressed reasonably (this is why it would be helpful if you have use cases in mind):

# A "shallow" nested type that we should be able to reproduce (I think)
Reference Test data [length: 3]:
[
  [ 1, 1, 2, 3,  4 ],
  [ 2, 2, 4, 6,  8 ],
  [ 3, 3, 6, 9, 12 ]
]

# A "deep" nested type that we may receive as input
Sample Test data (list<item: list<item: int64>>| [length: 3]):
[
  [ [ 1 ], [ 1, 2, 3,  4 ] ],
  [ [ 2 ], [ 2, 4, 6,  8 ] ],
  [ [ 3 ], [ 3, 6, 9, 12 ] ]
]

There is ambiguity in how a "deep" nested type should be hashed (maybe the nested structure should produce a different hash for a "row" of values), but what makes the most sense to me is for it to be value-based (the above example). In the case that a different approach is preferred, other approaches can either be encoded as future options or a UDF provided instead of the existing hashing functions.

I have prototyped a naive version of the flattening logic in a separate repo: recipe_convert.cpp#L150-L179. For context, this type of flattening would be used in the FastHashScalar::Exec function as a preprocessing step before calling the Hashing32::HashMultiColumn function.

Thanks for any feedback you can provide! Particularly on expected behavior for nested array types.

drin avatar Jul 03 '24 22:07 drin

An alternate approach is to return a

Status::NotImplemented("Cannot hash nested data types with many levels of nesting")

if any of the columns to hash are "deep" nested types.

drin avatar Jul 03 '24 22:07 drin

There is ambiguity in how a "deep" nested type should be hashed (maybe the nested structure should produce a different hash for a "row" of values), but what makes the most sense to me is for it to be value-based (the above example).

I'm not entirely sure I've understood your distinction between the two cases. Am I correct in understanding that in the first of the two options that you describe a row containing[[1], [1,2,3,4]] would have a different hash value to [[1,1], [2,3,4]] but in the second, 'value-based' option these two lists would have the same hash?

If that is the case then the 'value-based' behaviour would be surprising to me. I'd have thought that two rows with what is definitely different data should have different hashes (not considering the small chance of a hash collision).

For the case of nested listed I wonder if this issue could be avoided by including the offsets of the parent list in the hash?

judahrand avatar Jul 04 '24 08:07 judahrand

I'm not entirely sure I've understood your distinction between the two cases. Am I correct in understanding that in the first of the two options that you describe a row containing[[1], [1,2,3,4]] would have a different hash value to [[1,1], [2,3,4]] but in the second, 'value-based' option these two lists would have the same hash?

Oh yeah, I guess I didn't make that part clear. But yes, you understood correctly. I was in favor of [[1], [1, 2, 3, 4]] having the same hash as [[1, 1], [2, 3, 4]] (what I called the "value-based" option).

If you want those to have different hashes, then it may be possible to transform the list array into a struct array and one struct column can be all of the offsets coalesced and a second struct column can be all of the values (referenced). To do it any other way would require extending the existing hashing functions in non-trivial ways or adding new hashing functions entirely.

So, maybe for this PR I can simply do shallow nested arrays and someone can work on the deep nested arrays in aa follow-up PR.

drin avatar Jul 05 '24 05:07 drin

Closing this PR in favor of #45001 .

drin avatar Feb 26 '25 18:02 drin