mojo
mojo copied to clipboard
[Proposal] Improve the hash module
This proposal is based on discussion started in https://github.com/modularml/mojo/issues/1744
Nice proposal 👍🏻
Please note there appears to be a typo in the filename: imporved-hash-module.md
when it should actually be improved-hash-module.md
.
Thanks so much for going through the proposal process! We as a team really appreciate the thought and motivation presented. It makes things easier when discussing the design of the proposal. 🎉 🎉 🎉
Decision
- Accepted, pending responses to @gabrieldemarmiesse's comments and feedback on the PR
Some open questions to consider
- Should we create an alias for the return type of
finish(...)
(e.g.alias HASH_RETURN_T = UInt64
) rather than hard-coding toUInt64
? This gives us flexibility to change the return type in the future or have different hashers return different sized things if they care about space savings in certain cases. - We prefer the
hash_with
approach defaulting to theDefaultHasher
rather than thehasher_factory
approach. If you share more motivation/context on why the function approach may be useful, we may be better equipped to chime in. - When implementors (and users) are writing user-defined-types, it would be nice for them to not have to always worry about
init
,update
, andfinish
to show up in their code: i.e. they only worry abouthash
orhash_with
for example rather than the full API set for the trait requirements. - In the future (keep things simple for now as you have it!), we may want to consider fancy things like
write_length_prefix
from Rust. Have you given any thought to this sort of thing?
I created a small POC to address all the outstanding questions: https://gist.github.com/mzaks/aa66c831dc5e177c2322d5088aac76aa
Should we create an alias for the return type of finish(...) (e.g. alias HASH_RETURN_T = UInt64) rather than hard-coding to UInt64? This gives us flexibility to change the return type in the future or have different hashers return different sized things if they care about space savings in certain cases.
I suggest to parametrize the return type of finish
with a default, see. It might be sensible to put the hash value type parameter directly on the Hasher trait, which would allow Hasher to have different implementations for different DTypes, not just a bitcast at the end.
trait Hasher[hash_value_type: DType]:
fn __init__(inout self):
...
fn update(inout self, bytes: DTypePointer[DType.uint8], n: Int):
...
fn finish(owned self) -> Scalar[hash_value_type]:
...
I did not do it at this point in time mainly because the compiler does not allow default parameters on traits. (trait Hasher[hash_value_type: DType = DType.uint64]
does not work)
We prefer the hash_with approach defaulting to the DefaultHasher rather than the hasher_factory approach. If you share more motivation/context on why the function approach may be useful, we may be better equipped to chime in.
I prefer it too now. I though of factory approach, because some Hashers might have complex initialization needs, e.g. often they need a random seed and a secret. But as you can see here I was able to come up with a strategy where users can influence the hasher behavior with compile time parameters and environment variables.
When implementors (and users) are writing user-defined-types, it would be nice for them to not have to always worry about init, update, and finish to show up in their code: i.e. they only worry about hash or hash_with for example rather than the full API set for the trait requirements.
Yes, as you can see here it is trivial to implement the hash method when standard library implements hash methods for the basic types. As I mentioned here it would be very easy to synthesize the method as well. The only concern I have right now is, what happens with references and do we want to provide a solution for circular references. We could say that structs with references do a shallow hash, where they only hash the reference address. Or we say we allow deep hash, where we would need to be able to check if the referenced value was already visited by the Hasher. This would be possible if we add fn visited(inout self, id: ???) -> Bool
to the Hasher. As you can see I am not certain what the type of the reference id
should be, maybe Scalar[DType.address]
.
In the future (keep things simple for now as you have it!), we may want to consider fancy things like write_length_prefix from Rust. Have you given any thought to this sort of thing?
To be honest I did not consider it till now. After examining the method docs I think it's a bit redundant. The collection implementer can add the length of the collection as an Int to the hasher, in order to solve the issue of [1, 2, 3]
and [[1], [2, 3]]
generating same hash value. I think a special method does not help much.
I had a call with @gabrieldemarmiesse yesterday where we discussed the proposal in depth. As a result I commit some improvements to the proposal today.
Looks good! We can improve on the API and Python compatibility when the compiler has more flexibility :)