mendeleev Simplify hash and equality operations

Also broaden the test coverage with set operation and equality to different object.

Follow-up on https://github.com/lmmentel/mendeleev/pull/43

Dec 08 '21 13:12 kalvdans

The (macos-latest, 3.6) integration test failure looks like an unrelated internet access failure.

Dec 08 '21 13:12 kalvdans

Thanks for the help @kalvdans! I checked your version of __eq__ and __hash__ and I wonder what is your motivation for refactoring it that way. In other words, would you mind explaining why you think your solution should be used instead of the current one?

Also please take into account that python's hash returns the value it get if it's of type integer, e.g.

>>> hash(1)
1
>>> hash(100)
100
>>>

It only computes the actual hash for other types of values:

>>> hash("a")
-2483041677650604276

In your case the calling hash(self.atomic_number) seems redundant.

Dec 08 '21 22:12 lmmentel

A hash algorithm is used in hash tables to allow for constant time average lookup time. It should not be used for anything else.

The old implementation of __eq__ will have a slight risk of collission. Since it hashes strings, it gets the random seed mixed in. On 32-bit platsforms the hash is 32 bit. Let's say the hash is perfectly random. The probability of two elements out of the 118 existing ones having the same hash is then 1.62e-6 according to the birthday paradox. So once in every millionth Python invocation, two different elements will compare equal.

On 64-bit architectures, the risk is smaller, but it is just fundamentally wrong to use hash for equality.

Dec 09 '21 05:12 kalvdans

In your case the calling hash(self.atomic_number) seems redundant.

You are right. hash() automatically truncates integers to machine precision. I will change that, hang on!

Dec 09 '21 05:12 kalvdans

On 64-bit architectures, the risk is smaller, but it is just fundamentally wrong to use hash for equality.

I'm not sure that I agree since python dict and set operations are based on hashing, so if it's good enough for the standard library it's good enough for our use case.

It seems that the solution you are presenting is a weaker equality condition that the one based on hashing since it uses a single attribute for comparison instead of 69 attributes. I don't think that I would like to rely on a single attribute since the odds of someone accidentally assigning a new value to Element.atomic_number or any other attribute may be higher that probability of a hash collision (which is still pretty unlikely). In the case of hash based implementation it would be able to detect that and say the elements are not equal.

Dec 10 '21 08:12 lmmentel

I'm not sure that I agree since python dict and set operations are based on hashing, so if it's good enough for the standard library it's good enough for our use case.

They use the hash for inequality only, for equality they still call the __eq__ method, even if two hashes are the same.

someone accidentally assigning

Yes the documentation for __hash__ is clear that the object needs to be immutable. It is a bit tricky to enforce immutability on homemade Python objects, but that could be a future endeavour. Hashing all elements of a mutable object will still be buggy in case the object is already inside a dict.

Dec 10 '21 09:12 kalvdans

I cannot see a strong case to get this merged, would like to continue iterating on this?

Dec 12 '21 20:12 lmmentel

I care a lot about the quality of the mendeleev package, please tag someone else on the PR to get a third opinion.

Dec 13 '21 07:12 kalvdans

I care a lot about the quality of the mendeleev package, please tag someone else on the PR to get a third opinion.

Thanks for your concern. I cannot currently see enough reason to devote more attention to this issue. If you find a concrete example (as in https://stackoverflow.com/help/minimal-reproducible-example) that illustrates any vulnerabilities with the current implementation I would be happy to jump in and help address that.

Dec 19 '21 20:12 lmmentel

@mattwthompson, can you have a look at this PR?

Dec 20 '21 07:12 kalvdans