attrs icon indicating copy to clipboard operation
attrs copied to clipboard

Allow custom hashing function in `_make_hash`

Open OneRaynyDay opened this issue 2 years ago • 4 comments

Hi again!

As we know, hash(x) in python yields different values across different interpreter sessions for security. It turns out that I am using attrs in all of my class definitions in a library and would like to bootstrap off of the programmatically generated hash function to essentially create a merkle tree. I would like to use hashlib.md5() or something similar and create a digest using all the attributes attrs can find in the class so that:

  1. the hash is consistent across runs.
  2. less importantly for me, but maybe more importantly for others, using a better hash allows for less collisions.

As always, please let me know if this is something attrs can/cannot support, and I will work with/around it accordingly. Thanks!

P.S. My workaround would be to define my own function to find all attributes using attr.asdict() and then creating a hash that way. Not terrible, but I figured it'd be worth asking for a "purer" solution.

OneRaynyDay avatar Mar 04 '22 05:03 OneRaynyDay

You can make the hash consistent by using https://docs.python.org/3.3/using/cmdline.html#envvar-PYTHONHASHSEED, but I wouldn't recommend it (why do you want it to be consistent?).

Also my gut feeling is using md5() is going to be much, much slower than the built-in hash, enough so that any reduced hash collisions will be completely lost in the overhead.

Tinche avatar Mar 04 '22 11:03 Tinche

would like to bootstrap off of the programmatically generated hash function to essentially create a merkle tree

I'm aware of PYTHONHASHSEED. Using it+hash has a host of problems including the inability to set it within a python interpreter session, no guarantees of portability, etc.

I want it to be consistent because I'm running a distributed workload which requires spinning up multiple python processes remotely, and I would like to make sure that they all hash to the same result so I can check for the existence of an attrs instance in a distributed manner.

OneRaynyDay avatar Mar 04 '22 15:03 OneRaynyDay

I wanna say for your use case it's cleaner to have a separate concept of hashes, than re-using the one from Python.

That said, we offer customization for other methods too and I think it wouldn't be too complex to offer the call to replace hash in hash((cls, attr1, attr2, …)) with something custom. Therefore I would merge a good PR to add this.

hynek avatar Mar 11 '22 07:03 hynek

tbh, i think using hash() for anything else than what python is using it for (dictionaries, sets, etc.) is a terrible idea, and attrs shouldn't facilitate hooking into what i consider internal puzzle pieces used for hash generation, beyond what's currently possible:

  • whether to make a class hashable at all (generated __hash__())
  • if so, which attributes to consider (attr.field(hash=...) arg)

what problem are we trying to solve here? the original description mentions ‘merkle tree’, but these are typically used in security sensitive domains. i'd think a cryptographic hash function over a stable serialization, which cattrs + json.dumps(sort_keys=True) can provide for example, is a much better approach here.

wbolster avatar Mar 21 '22 16:03 wbolster