Clarify hash specification for date and boolean
It was using hashInt as a way of saying "hash like int type". However, this turned out to be confused with "invoke hashInt on a hasher", which yields a different result.
Make it explicit how the hash should be calculated by avoiding delegation and mentioning hashLong explicitly.
The float hash also uses this delegation pattern
float hashDouble(double(v)) [4] 1.0F → -142385009
however, it's not being confused by readers, because there is no other "hashDouble" operation available.
Addresses confusion as discussed here https://apache-iceberg.slack.com/archives/C03LG1D563F/p1662993090595709 and here https://apache-iceberg.slack.com/archives/C03LG1D563F/p1663206871379419
The intent here was to refer to the function defined for the int type. It looks like people are confusing these functions with hash specific methods in the hash implementation. I think it's fine to clarify, but this change makes the spec less specific because it removes the long(value) cast that is required for int and makes it implicit.
Can we rephrase the change here so that it is simply clear that the hash for date depends on the hash for int?
The intent here was to refer to the function defined for the
inttype.
I know. This PR isn't fixing anything, as nothing was wrong. It only tries to avoid potentially confusing wording.
t looks like people are confusing these functions with hash specific methods in the hash implementation.
Indeed.
this change makes the spec less specific because it removes the long(value) cast that is required for int and makes it implicit.
Not sure where you see this.
Is it in hashLong(daysFromUnixEpoch(v))?
The daysFromUnixEpoch it a hypothetic function, not defined anywhere, so a reader can just assume it returns a long number, if that's what hashLong expects.
daysFromUnixEpoch is represented in most places as an int and I didn't want any confusion over how it should be represented. An int is perfectly fine. Since hashInt is also a hypothetical function, it's okay to use it.
I think the fix is to name the functions in the spec as they are defined to make it clear that we're referring back to them.
Since
hashIntis also a hypothetical function, it's okay to use it.
it's also a name of org.apache.iceberg.relocated.com.google.common.hash.HashFunction#hashInt available on the Hashing.murmur3_32_fixed object.
daysFromUnixEpoch is represented in most places as an int
iceberg master$ git grep daysFromUnixEpoch
format/spec.md:| **`date`** | `hashInt(daysFromUnixEpoch(v))` | `2017-11-16` → `-653330422` |
I think the fix is to name the functions in the spec as they are defined to make it clear that we're referring back to them.
i am not sure i am following. can you please exemplify?
@rdblue i am still looking for more guidance from you.
@rdblue @nastra @bitsondatadev should i rebase or close?
@findepi just saw this, let me follow up with @rdblue after the holiday!
@findepi
@rdblue says there's still some ambiguity. Rather than using hashInt that refers to the hashLong(long(i)) you changed it to just hashLong(...). But the cast is important.
@bitsondatadev you mean it should be hashLong(long(daysFromUnixEpoch(v))) instead of hashLong(daysFromUnixEpoch(v))? I don't see how the latter is ambiguous, but sure I can do that. I want confirmation though that I understood the right way