mojo icon indicating copy to clipboard operation
mojo copied to clipboard

[Proposal] Improve the hash module

Open mzaks opened this issue 2 months ago • 6 comments

This proposal is based on discussion started in https://github.com/modularml/mojo/issues/1744

mzaks avatar Apr 09 '24 15:04 mzaks

Nice proposal 👍🏻 Please note there appears to be a typo in the filename: imporved-hash-module.md when it should actually be improved-hash-module.md.

theopomies avatar Apr 10 '24 09:04 theopomies

Thanks so much for going through the proposal process! We as a team really appreciate the thought and motivation presented. It makes things easier when discussing the design of the proposal. 🎉 🎉 🎉

Decision

  • Accepted, pending responses to @gabrieldemarmiesse's comments and feedback on the PR

Some open questions to consider

  • Should we create an alias for the return type of finish(...) (e.g. alias HASH_RETURN_T = UInt64) rather than hard-coding to UInt64? This gives us flexibility to change the return type in the future or have different hashers return different sized things if they care about space savings in certain cases.
  • We prefer the hash_with approach defaulting to the DefaultHasher rather than the hasher_factory approach. If you share more motivation/context on why the function approach may be useful, we may be better equipped to chime in.
  • When implementors (and users) are writing user-defined-types, it would be nice for them to not have to always worry about init, update, and finish to show up in their code: i.e. they only worry about hash or hash_with for example rather than the full API set for the trait requirements.
  • In the future (keep things simple for now as you have it!), we may want to consider fancy things like write_length_prefix from Rust. Have you given any thought to this sort of thing?

JoeLoser avatar Apr 18 '24 15:04 JoeLoser

I created a small POC to address all the outstanding questions: https://gist.github.com/mzaks/aa66c831dc5e177c2322d5088aac76aa

Should we create an alias for the return type of finish(...) (e.g. alias HASH_RETURN_T = UInt64) rather than hard-coding to UInt64? This gives us flexibility to change the return type in the future or have different hashers return different sized things if they care about space savings in certain cases.

I suggest to parametrize the return type of finish with a default, see. It might be sensible to put the hash value type parameter directly on the Hasher trait, which would allow Hasher to have different implementations for different DTypes, not just a bitcast at the end.

trait Hasher[hash_value_type: DType]:
    fn __init__(inout self):
        ...
    fn update(inout self, bytes: DTypePointer[DType.uint8], n: Int):
        ...
    fn finish(owned self) -> Scalar[hash_value_type]:
        ...

I did not do it at this point in time mainly because the compiler does not allow default parameters on traits. (trait Hasher[hash_value_type: DType = DType.uint64] does not work)

We prefer the hash_with approach defaulting to the DefaultHasher rather than the hasher_factory approach. If you share more motivation/context on why the function approach may be useful, we may be better equipped to chime in.

I prefer it too now. I though of factory approach, because some Hashers might have complex initialization needs, e.g. often they need a random seed and a secret. But as you can see here I was able to come up with a strategy where users can influence the hasher behavior with compile time parameters and environment variables.

When implementors (and users) are writing user-defined-types, it would be nice for them to not have to always worry about init, update, and finish to show up in their code: i.e. they only worry about hash or hash_with for example rather than the full API set for the trait requirements.

Yes, as you can see here it is trivial to implement the hash method when standard library implements hash methods for the basic types. As I mentioned here it would be very easy to synthesize the method as well. The only concern I have right now is, what happens with references and do we want to provide a solution for circular references. We could say that structs with references do a shallow hash, where they only hash the reference address. Or we say we allow deep hash, where we would need to be able to check if the referenced value was already visited by the Hasher. This would be possible if we add fn visited(inout self, id: ???) -> Bool to the Hasher. As you can see I am not certain what the type of the reference id should be, maybe Scalar[DType.address].

In the future (keep things simple for now as you have it!), we may want to consider fancy things like write_length_prefix from Rust. Have you given any thought to this sort of thing?

To be honest I did not consider it till now. After examining the method docs I think it's a bit redundant. The collection implementer can add the length of the collection as an Int to the hasher, in order to solve the issue of [1, 2, 3] and [[1], [2, 3]] generating same hash value. I think a special method does not help much.

mzaks avatar Apr 26 '24 05:04 mzaks

I had a call with @gabrieldemarmiesse yesterday where we discussed the proposal in depth. As a result I commit some improvements to the proposal today.

mzaks avatar Apr 28 '24 12:04 mzaks

Looks good! We can improve on the API and Python compatibility when the compiler has more flexibility :)

gabrieldemarmiesse avatar Apr 28 '24 17:04 gabrieldemarmiesse