Simplify hash and equality operations
Also broaden the test coverage with set operation and equality to different object.
Follow-up on https://github.com/lmmentel/mendeleev/pull/43
The (macos-latest, 3.6) integration test failure looks like an unrelated internet access failure.
Thanks for the help @kalvdans! I checked your version of __eq__ and __hash__ and I wonder what is your motivation for refactoring it that way. In other words, would you mind explaining why you think your solution should be used instead of the current one?
Also please take into account that python's hash returns the value it get if it's of type integer, e.g.
>>> hash(1)
1
>>> hash(100)
100
>>>
It only computes the actual hash for other types of values:
>>> hash("a")
-2483041677650604276
In your case the calling hash(self.atomic_number) seems redundant.
A hash algorithm is used in hash tables to allow for constant time average lookup time. It should not be used for anything else.
The old implementation of __eq__ will have a slight risk of collission. Since it hashes strings, it gets the random seed mixed in. On 32-bit platsforms the hash is 32 bit. Let's say the hash is perfectly random. The probability of two elements out of the 118 existing ones having the same hash is then 1.62e-6 according to the birthday paradox. So once in every millionth Python invocation, two different elements will compare equal.
On 64-bit architectures, the risk is smaller, but it is just fundamentally wrong to use hash for equality.
In your case the calling hash(self.atomic_number) seems redundant.
You are right. hash() automatically truncates integers to machine precision. I will change that, hang on!
On 64-bit architectures, the risk is smaller, but it is just fundamentally wrong to use hash for equality.
I'm not sure that I agree since python dict and set operations are based on hashing, so if it's good enough for the standard library it's good enough for our use case.
It seems that the solution you are presenting is a weaker equality condition that the one based on hashing since it uses a single attribute for comparison instead of 69 attributes. I don't think that I would like to rely on a single attribute since the odds of someone accidentally assigning a new value to Element.atomic_number or any other attribute may be higher that probability of a hash collision (which is still pretty unlikely). In the case of hash based implementation it would be able to detect that and say the elements are not equal.
I'm not sure that I agree since python
dictandsetoperations are based on hashing, so if it's good enough for the standard library it's good enough for our use case.
They use the hash for inequality only, for equality they still call the __eq__ method, even if two hashes are the same.
someone accidentally assigning
Yes the documentation for __hash__ is clear that the object needs to be immutable. It is a bit tricky to enforce immutability on homemade Python objects, but that could be a future endeavour. Hashing all elements of a mutable object will still be buggy in case the object is already inside a dict.
I cannot see a strong case to get this merged, would like to continue iterating on this?
I care a lot about the quality of the mendeleev package, please tag someone else on the PR to get a third opinion.
I care a lot about the quality of the mendeleev package, please tag someone else on the PR to get a third opinion.
Thanks for your concern. I cannot currently see enough reason to devote more attention to this issue. If you find a concrete example (as in https://stackoverflow.com/help/minimal-reproducible-example) that illustrates any vulnerabilities with the current implementation I would be happy to jump in and help address that.
@mattwthompson, can you have a look at this PR?