pymatgen icon indicating copy to clipboard operation
pymatgen copied to clipboard

change equality comparison operation to `as_dict`

Open orionarcher opened this issue 2 years ago • 19 comments

Summary

The current equality operation for InputSet's will fail if InputSet.inputs has mutable values. This happens often because InputSet is designed to support InputFiles as inputs. I have changed the equality operation to use as_dict which will coerce all nested elements to immutable types, enabling easier comparison. Would love a final look from @rkingsbury to make sure this makes sense.

This follows up on a previous issue I raised where the equality operation was failing entirely.

orionarcher avatar Sep 23 '22 00:09 orionarcher

Coverage Status

Coverage increased (+0.002%) to 63.347% when pulling 7b350d1ed26eee7da2e4f09bc35185c14ed8d197 on orionarcher:fix_input_set_equality into 6c23d744efbd892ec48346297d61b4f3f86b1478 on materialsproject:master.

coveralls avatar Sep 23 '22 00:09 coveralls

Summary

The current equality operation for InputSet's will fail if InputSet.inputs has mutable values. This happens often because InputSet is designed to support InputFiles as inputs. I have changed the equality operation to use as_dict which will coerce all nested elements to immutable types, enabling easier comparison. Would love a final look from @rkingsbury to make sure this makes sense.

This follows up on a previous issue I raised where the equality operation was failing entirely.

Thanks for identifying @orionarcher . This looks find to me and appears innocuous as far as compatibility goes. Could you add a test that compares InputSet with mutable types?

For others reading this - see #2575 for previous work on this issue.

rkingsbury avatar Sep 23 '22 16:09 rkingsbury

I don't think InputSets are meant to be compared?

shyuep avatar Sep 23 '22 16:09 shyuep

I don't think InputSets are meant to be compared?

It may be somewhat unusual , but since InputSet are dict-like mappings it makes sense to me that it should be possible to compare them.

rkingsbury avatar Sep 23 '22 16:09 rkingsbury

@rkingsbury No no.... the question is whether they are meant to be compared. The default answer for most objects should be no. In the case of input sets, it is an algorithm to generate a set of input files for a given structure. You can compare structures. But something like MPStaticSet == MPRelaxSet makes zero sense.

shyuep avatar Sep 23 '22 16:09 shyuep

@rkingsbury No no.... the question is whether they are meant to be compared. The default answer for most objects should be no. In the case of input sets, it is an algorithm to generate a set of input files for a given structure. You can compare structures. But something like MPStaticSet == MPRelaxSet makes zero sense.

I agree that MPStaticSet == MPRelaxSet makes no sense. But in the new IO paradigm, remember that MPStaticSet and MPRelaxSet would be input generators which lack structure information. What Orion is referring to is comparing a specific InputSet created by the generator, for a specific structure. You might want to do this if you have a large number of calculations prepared, and you need to deduplicate them. If all the settings are the same (i.e., all the InputSet attributes), and the structure is the same, then the two InputSet define identical calculations, and it makes sense to consider them "equal".

rkingsbury avatar Sep 23 '22 16:09 rkingsbury

I guess I really don't understand the use case. If InputSet contains structure, then the structure comparison will tell you whether it is equal. If the InputSet is merely a generator that does not contain structure, then you should initialize only one InputSet for each distinct type of calculation you do. I don't see a scenario where I initialize multiple InputSets and end up having to deduplicate them. Deduplication should take place at the level of the structure, not at the InputSet level.

shyuep avatar Sep 23 '22 17:09 shyuep

I guess I really don't understand the use case. If InputSet contains structure, then the structure comparison will tell you whether it is equal. If the InputSet is merely a generator that does not contain structure, then you should initialize only one InputSet for each distinct type of calculation you do. I don't see a scenario where I initialize multiple InputSets and end up having to deduplicate them. Deduplication should take place at the level of the structure, not at the InputSet level.

I hear what you're saying. I think there is confusion here because of terminology - when we developed the Abstract IO interface we decided to retain the name InputSet even though its meaning changed slightly compared to the legacy approach. The potential for confusion is compounded by the fact that the VASP InputSet in pymatgen still use the legacy paradigm. (Much progress has been made porting VASP workflows to the new paradigm over in atomate2, and I think that work will be moved to pymatgen soon.)

Let me try to clarify. In the new paradigm, InputGenerator do exactly what you describe here:

you should initialize only one InputSet for each distinct type of calculation you do

You initialize one InputGenerator e.g. a MPRelaxSetGen() for each calculation type. This contains all settings but no structure. Then you call InputGenerator().get_input_set(structure) to create an InputSet that is a concrete calculation (settings + structure). This object is similar to VaspInput. That is the object that we want to compare.

Deduplication should take place at the level of the structure, not at the InputSet level.

I would say you want to deduplicate on the level of structure+settings, hence the value of having InputSet equality.

rkingsbury avatar Sep 23 '22 17:09 rkingsbury

I disagree on the deduplication strategy. Let me phrase it differently. Say you have a MPStaticGenerator and a MPRelaxGenerator (using the old terminology for convenience here). Why would I want to bother checking the settings as well as the structure? I would deduplicate the structures first. That guarantees that the inputs generated would be unique. You will need to provide me with a scenario where equal structures can have equal input settings for different generators, or equal structures can have different input settings for the same generator.

shyuep avatar Sep 23 '22 17:09 shyuep

Say you have a MPStaticGenerator and a MPRelaxGenerator (using the old terminology for convenience here). Why would I want to bother checking the settings as well as the structure?

Provided that all InputSet have been created by the same InputGenerator, then you would not need to bother checking settings. However, one of the goals in the new IO paradigm was to standardize the behavior of InputSet, regardless of how it was generated. That means I could have a collection of InputSet that are created from different generators.

Consider this pseudo-code example:

s = Structure(...)
MPRelaxGenerator(<relaxation settings>).get_input_set(s) -> InputSet1

MPStaticGenerator(<static settings>).get_input_set(s) -> InputSet2

InputSet1 and InputSet2 behave the same way by design - they have the same class methods and attributes, etc. But one defines a static calc and the other a relax. If we only compared based on the structure, we would conclude InputSet1 == InputSet2. But that's not the case. They define different sets of calculation inputs.

rkingsbury avatar Sep 23 '22 17:09 rkingsbury

Yes, what I mean is when would you be in no control of InputGen1 InputGen2 such that you have to ex-post check for equality of InputSets? My conceptualization is that you will decide you only want to use InputGen1 and InputGen2, and you supply unique structures to Gen1 and Gen2. By definition, the four InputSets will all be unique. There is no reason to check it at the end. In fact, checking at the end of InputSet guarantees that instead of checking only n structures for equality, you will have to check n*k InputSets. Given that checking is O(N^2), that scales extremely poorly.

If you insist on checking InputSets rather than ensuring uniqueness, then I would suggest you simply have a metadata that says this InputSet was created using InputGen1. In that case, you can do checks in two separate n and k checks rather than one giant n*k check.

shyuep avatar Sep 23 '22 19:09 shyuep

Yes, what I mean is when would you be in no control of InputGen1 InputGen2 such that you have to ex-post check for equality of InputSets? My conceptualization is that you will decide you only want to use InputGen1 and InputGen2, and you supply unique structures to Gen1 and Gen2. By definition, the four InputSets will all be unique. There is no reason to check it at the end. In fact, checking at the end of InputSet guarantees that instead of checking only n structures for equality, you will have to check n*k InputSets. Given that checking is O(N^2), that scales extremely poorly.

If you insist on checking InputSets rather than ensuring uniqueness, then I would suggest you simply have a metadata that says this InputSet was created using InputGen1. In that case, you can do checks in two separate n and k checks rather than one giant n*k check.

Fair enough. Regardless of whether this is an efficient workflow or not though, I don't see any harm in having an equality operation in place for InputSet. It inherits from MutableMapping, behaves similar to a dictionary, and therefore I think it's reasonable that users might expect to be able test equality.

an equality check is also useful for unit testing, where you might want to do

is1 = SomeInputSet.from_directory(<test_dir>)
is2 = SomeInputGenerator().get_input_set(s)
assert is1==is2

I think this is how @orionarcher discovered the problem documented here and in the linked issue.

rkingsbury avatar Sep 23 '22 20:09 rkingsbury

I'd like to offer another perspective to the one above. I've never used VASP and I have never used the Pymatgen Structure. I have been using the new pymatgen IO while developing the pymatgen.io.openmm add-on package. There, as @rkingsbury said, I test equality in the unit tests to make sure that writing objects to and from MSON does not alter them. Since I am mixing together code from the openmm and pymatgen pacakges, this was not straightforward.

Testing equality has already been useful for me and I think it's worth keeping in mind use cases beyond our immediate development workflows, especially in light of the new pymatgen.addon infrastructure.

orionarcher avatar Sep 23 '22 20:09 orionarcher

I think we need to distinguish between unittest checking - an esoteric thing that just checks if something works as intended, vs an actual functionality that is encouraged for users. E.g., when we do unittest checking, we can do really blunt stuff like comparing strings of input files. Terrible for efficiency and very prone to numerical issues, but nonetheless, a good enough thing for a lot of purposes.

Re the openmm, I am somewhat confused by the statement that pymatgen Structure is not used. I would think the whole idea of pymatgen.io.xxx is to go from a pymatgen Molecule/Structure to a set of input files?

shyuep avatar Sep 23 '22 20:09 shyuep

Re the openmm, I am somewhat confused by the statement that pymatgen Structure is not used. I would think the whole idea of pymatgen.io.xxx is to go from a pymatgen Molecule/Structure to a set of input files?

It uses Molecule to generate a set of input files

I feel our discussion is getting off track. As you often remind me @shyuep , we can't put so many guardrails in pymatgen that we guarantee users won't do silly things. If someone wants to put a single MPRester query in a for loop and run it a million times (and thrash our API), we can't prevent that. InputSet equality checking may not always be the most efficient way to accomplish a task, but it is clearly useful (as demonstrated for unit testing and as evidenced by the fact that @orionarcher , who has developed a very nice pymatgen addon package, needs it for further work). Moreover, we already have equality checking in place, this PR is just a bugfix.

I'd like to refocus on our choices here, which are 1) to move forward with the bugfix or 2) to remove InputSet equality checking altogether, which will put a significant amount of extra work on @orionarcher to create good quality tests for the new OpenMM IO. Moreover as you point out @shyuep , doing so might require "blunt stuff like comparing strings of input files." that are "Terrible for efficiency and very prone to numerical issues". Resorting to approaches like that inherently reduces the quality of the tests in my opinion, because it's much easier to make a subtle mistake.

In the particular case of openmm, the input files are .xml based due to the way the code is constructed, so the effort to get tests working would be even more involved.

Needless to say, I favor option 1. Do you see any technical problems with using as_dict as the basis for the equality check?

rkingsbury avatar Sep 23 '22 21:09 rkingsbury

I'm not sure I'd agree that checking if something works as intended is esoteric, but I do see your point RE efficiency. In this case, perhaps prioritizing developer time over computational cost might be the right move, as the cost of IO will be dwarfed by the cost of simulations. In short, I'd support option 1.

In classical MD simulations we need to specify bonding ahead of time, so Molecule and Structure are not very convenient representations. The IO heavily uses the new pymatgen IO paradigm and Molecules do enter the code in several places.

orionarcher avatar Sep 23 '22 21:09 orionarcher

I favor not checking InputSets for equality. In the end, what you are saying is that instead of:

self.assertEqual(inputset1, inputset2)

the developer has to write

self.assertEqual(inputset1.as_dict(), inputset2.as_dict())

The latter doesn't seem like a lot of effort to me and is in fact used in many places for tests (other than strings).

My view is that the commitment to an __eq__ implementation is highly significant and requires rigorous definition. That is why Python objects by default do not implement __eq__ and only checks for equality of ids. Let me present a scenario - let's say someone supplies methane to an input set and say that person is not particularly careful and supplies a second methane that is rotated by 15 degrees to the first. For all intents and purposes, the two InputSets are equal for most computational codes - Gaussian, VASP, whatever you can think of. But using as_dict would result in the two InputSets as unequal.

The collorary I can give is Structure.__eq__ vs Structure.matches. Here, we have a definition that __eq__ is meant to compare based on numerical closeness while matches (a far more expensive method) actually handles site mappings. The latter is the "correct" way of doing everything. There are almost no instances where structure1 == structure2 is the correct method to call by a user. But there is an important use case in that we need Structure to be usable as a key in a dictionary that is being updated and that's where the __eq__ comes in. That "important to be usable as a key in a dict" is just about the only situation where I think it is useful to have a non-rigorous __eq__ definition. This applies to things like ComputedEntry as well.

Certainly I agree we do not need to put too many guardrails on users. But here, we are actively implementing a feature that encourages bad use with very limited good use case (unittest is certainly one good use case, but like I pointed out, it is not a user use-case and it can be dealt with using minimal additional code). In essence, we are giving users a way to do an O((Nk)^2) comparison instead of an O(N^2) comparison with zero difference in outcomes. I also don't see InputSets being used as keys in a dict. In fact, usage as a key would violate the general concept that "sets" are usually non-hashable.

shyuep avatar Sep 23 '22 21:09 shyuep

My view is that the commitment to an __eq__ implementation is highly significant and requires rigorous definition. That is why Python objects by default do not implement __eq__ and only checks for equality of ids.

I see; I didn't realize that there was a more general architectural decision to not implement __eq__ methods. You make a good point that we can just use InputSet1.as_dict() == InputSet2.as_dict() in tests. That's an equally good technical solution that doesn't impose extra overhead on developers.

Maybe the best course of action then is to remove the __eq__ method. @orionarcher , if you agree, would you mind updating your PR accordingly?

rkingsbury avatar Sep 23 '22 22:09 rkingsbury

I've opened a PR to remove eq so I think we can close this one. Hopefully this will not cause too much disruption to ongoing development of inherited classes.

rkingsbury avatar Sep 29 '22 18:09 rkingsbury

I'll close this as completed by #2671.

janosh avatar May 08 '23 19:05 janosh