attrs
attrs copied to clipboard
Allow custom hashing function in `_make_hash`
Hi again!
As we know, hash(x)
in python yields different values across different interpreter sessions for security. It turns out that I am using attrs in all of my class definitions in a library and would like to bootstrap off of the programmatically generated hash
function to essentially create a merkle tree. I would like to use hashlib.md5()
or something similar and create a digest using all the attributes attrs can find in the class so that:
- the hash is consistent across runs.
- less importantly for me, but maybe more importantly for others, using a better hash allows for less collisions.
As always, please let me know if this is something attrs can/cannot support, and I will work with/around it accordingly. Thanks!
P.S. My workaround would be to define my own function to find all attributes using attr.asdict()
and then creating a hash that way. Not terrible, but I figured it'd be worth asking for a "purer" solution.
You can make the hash consistent by using https://docs.python.org/3.3/using/cmdline.html#envvar-PYTHONHASHSEED, but I wouldn't recommend it (why do you want it to be consistent?).
Also my gut feeling is using md5() is going to be much, much slower than the built-in hash, enough so that any reduced hash collisions will be completely lost in the overhead.
would like to bootstrap off of the programmatically generated hash function to essentially create a merkle tree
I'm aware of PYTHONHASHSEED
. Using it+hash
has a host of problems including the inability to set it within a python interpreter session, no guarantees of portability, etc.
I want it to be consistent because I'm running a distributed workload which requires spinning up multiple python processes remotely, and I would like to make sure that they all hash to the same result so I can check for the existence of an attrs instance in a distributed manner.
I wanna say for your use case it's cleaner to have a separate concept of hashes, than re-using the one from Python.
That said, we offer customization for other methods too and I think it wouldn't be too complex to offer the call to replace hash in hash((cls, attr1, attr2, …))
with something custom. Therefore I would merge a good PR to add this.
tbh, i think using hash()
for anything else than what python is using it for (dictionaries, sets, etc.) is a terrible idea, and attrs
shouldn't facilitate hooking into what i consider internal puzzle pieces used for hash generation, beyond what's currently possible:
- whether to make a class hashable at all (generated
__hash__()
) - if so, which attributes to consider (
attr.field(hash=...)
arg)
what problem are we trying to solve here? the original description mentions ‘merkle tree’, but these are typically used in security sensitive domains. i'd think a cryptographic hash function over a stable serialization, which cattrs
+ json.dumps(sort_keys=True)
can provide for example, is a much better approach here.