iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Clarify hash specification for date and boolean

Open findepi opened this issue 3 years ago • 2 comments

It was using hashInt as a way of saying "hash like int type". However, this turned out to be confused with "invoke hashInt on a hasher", which yields a different result.

Make it explicit how the hash should be calculated by avoiding delegation and mentioning hashLong explicitly.

findepi avatar Sep 16 '22 07:09 findepi

The float hash also uses this delegation pattern

float hashDouble(double(v)) [4] 1.0F → -142385009

however, it's not being confused by readers, because there is no other "hashDouble" operation available.

findepi avatar Sep 16 '22 07:09 findepi

Addresses confusion as discussed here https://apache-iceberg.slack.com/archives/C03LG1D563F/p1662993090595709 and here https://apache-iceberg.slack.com/archives/C03LG1D563F/p1663206871379419

findepi avatar Sep 19 '22 11:09 findepi

The intent here was to refer to the function defined for the int type. It looks like people are confusing these functions with hash specific methods in the hash implementation. I think it's fine to clarify, but this change makes the spec less specific because it removes the long(value) cast that is required for int and makes it implicit.

Can we rephrase the change here so that it is simply clear that the hash for date depends on the hash for int?

rdblue avatar Sep 29 '22 21:09 rdblue

The intent here was to refer to the function defined for the int type.

I know. This PR isn't fixing anything, as nothing was wrong. It only tries to avoid potentially confusing wording.

t looks like people are confusing these functions with hash specific methods in the hash implementation.

Indeed.

this change makes the spec less specific because it removes the long(value) cast that is required for int and makes it implicit.

Not sure where you see this. Is it in hashLong(daysFromUnixEpoch(v))? The daysFromUnixEpoch it a hypothetic function, not defined anywhere, so a reader can just assume it returns a long number, if that's what hashLong expects.

findepi avatar Sep 30 '22 09:09 findepi

daysFromUnixEpoch is represented in most places as an int and I didn't want any confusion over how it should be represented. An int is perfectly fine. Since hashInt is also a hypothetical function, it's okay to use it.

I think the fix is to name the functions in the spec as they are defined to make it clear that we're referring back to them.

rdblue avatar Sep 30 '22 16:09 rdblue

Since hashInt is also a hypothetical function, it's okay to use it.

it's also a name of org.apache.iceberg.relocated.com.google.common.hash.HashFunction#hashInt available on the Hashing.murmur3_32_fixed object.

daysFromUnixEpoch is represented in most places as an int

iceberg master$ git grep daysFromUnixEpoch
format/spec.md:| **`date`**         | `hashInt(daysFromUnixEpoch(v))`           | `2017-11-16` → `-653330422`                |

I think the fix is to name the functions in the spec as they are defined to make it clear that we're referring back to them.

i am not sure i am following. can you please exemplify?

findepi avatar Sep 30 '22 18:09 findepi

@rdblue i am still looking for more guidance from you.

findepi avatar Oct 06 '22 10:10 findepi

@rdblue @nastra @bitsondatadev should i rebase or close?

findepi avatar Jul 03 '23 13:07 findepi

@findepi just saw this, let me follow up with @rdblue after the holiday!

bitsondatadev avatar Jul 04 '23 06:07 bitsondatadev

@findepi

@rdblue says there's still some ambiguity. Rather than using hashInt that refers to the hashLong(long(i)) you changed it to just hashLong(...). But the cast is important.

bitsondatadev avatar Jul 11 '23 00:07 bitsondatadev

@bitsondatadev you mean it should be hashLong(long(daysFromUnixEpoch(v))) instead of hashLong(daysFromUnixEpoch(v))? I don't see how the latter is ambiguous, but sure I can do that. I want confirmation though that I understood the right way

findepi avatar Jul 12 '23 15:07 findepi