pymatgen
pymatgen copied to clipboard
change equality comparison operation to `as_dict`
Summary
The current equality operation for InputSet's will fail if InputSet.inputs
has mutable values. This happens often because InputSet is designed to support InputFile
s as inputs
. I have changed the equality operation to use as_dict
which will coerce all nested elements to immutable types, enabling easier comparison. Would love a final look from @rkingsbury to make sure this makes sense.
This follows up on a previous issue I raised where the equality operation was failing entirely.
Coverage increased (+0.002%) to 63.347% when pulling 7b350d1ed26eee7da2e4f09bc35185c14ed8d197 on orionarcher:fix_input_set_equality into 6c23d744efbd892ec48346297d61b4f3f86b1478 on materialsproject:master.
Summary
The current equality operation for InputSet's will fail if
InputSet.inputs
has mutable values. This happens often because InputSet is designed to supportInputFile
s asinputs
. I have changed the equality operation to useas_dict
which will coerce all nested elements to immutable types, enabling easier comparison. Would love a final look from @rkingsbury to make sure this makes sense.This follows up on a previous issue I raised where the equality operation was failing entirely.
Thanks for identifying @orionarcher . This looks find to me and appears innocuous as far as compatibility goes. Could you add a test that compares InputSet
with mutable types?
For others reading this - see #2575 for previous work on this issue.
I don't think InputSets are meant to be compared?
I don't think InputSets are meant to be compared?
It may be somewhat unusual , but since InputSet
are dict-like mappings it makes sense to me that it should be possible to compare them.
@rkingsbury No no.... the question is whether they are meant to be compared. The default answer for most objects should be no. In the case of input sets, it is an algorithm to generate a set of input files for a given structure. You can compare structures. But something like MPStaticSet == MPRelaxSet makes zero sense.
@rkingsbury No no.... the question is whether they are meant to be compared. The default answer for most objects should be no. In the case of input sets, it is an algorithm to generate a set of input files for a given structure. You can compare structures. But something like MPStaticSet == MPRelaxSet makes zero sense.
I agree that MPStaticSet == MPRelaxSet makes no sense. But in the new IO paradigm, remember that MPStaticSet and MPRelaxSet would be input generators which lack structure information. What Orion is referring to is comparing a specific InputSet
created by the generator, for a specific structure. You might want to do this if you have a large number of calculations prepared, and you need to deduplicate them. If all the settings are the same (i.e., all the InputSet
attributes), and the structure is the same, then the two InputSet
define identical calculations, and it makes sense to consider them "equal".
I guess I really don't understand the use case. If InputSet contains structure, then the structure comparison will tell you whether it is equal. If the InputSet is merely a generator that does not contain structure, then you should initialize only one InputSet for each distinct type of calculation you do. I don't see a scenario where I initialize multiple InputSets and end up having to deduplicate them. Deduplication should take place at the level of the structure, not at the InputSet level.
I guess I really don't understand the use case. If InputSet contains structure, then the structure comparison will tell you whether it is equal. If the InputSet is merely a generator that does not contain structure, then you should initialize only one InputSet for each distinct type of calculation you do. I don't see a scenario where I initialize multiple InputSets and end up having to deduplicate them. Deduplication should take place at the level of the structure, not at the InputSet level.
I hear what you're saying. I think there is confusion here because of terminology - when we developed the Abstract IO interface we decided to retain the name InputSet
even though its meaning changed slightly compared to the legacy approach. The potential for confusion is compounded by the fact that the VASP InputSet
in pymatgen still use the legacy paradigm. (Much progress has been made porting VASP workflows to the new paradigm over in atomate2
, and I think that work will be moved to pymatgen soon.)
Let me try to clarify. In the new paradigm, InputGenerator
do exactly what you describe here:
you should initialize only one InputSet for each distinct type of calculation you do
You initialize one InputGenerator
e.g. a MPRelaxSetGen()
for each calculation type. This contains all settings but no structure. Then you call InputGenerator().get_input_set(structure)
to create an InputSet
that is a concrete calculation (settings + structure). This object is similar to VaspInput
. That is the object that we want to compare.
Deduplication should take place at the level of the structure, not at the InputSet level.
I would say you want to deduplicate on the level of structure+settings, hence the value of having InputSet
equality.
I disagree on the deduplication strategy. Let me phrase it differently. Say you have a MPStaticGenerator and a MPRelaxGenerator (using the old terminology for convenience here). Why would I want to bother checking the settings as well as the structure? I would deduplicate the structures first. That guarantees that the inputs generated would be unique. You will need to provide me with a scenario where equal structures can have equal input settings for different generators, or equal structures can have different input settings for the same generator.
Say you have a MPStaticGenerator and a MPRelaxGenerator (using the old terminology for convenience here). Why would I want to bother checking the settings as well as the structure?
Provided that all InputSet
have been created by the same InputGenerator
, then you would not need to bother checking settings. However, one of the goals in the new IO paradigm was to standardize the behavior of InputSet
, regardless of how it was generated. That means I could have a collection of InputSet
that are created from different generators.
Consider this pseudo-code example:
s = Structure(...)
MPRelaxGenerator(<relaxation settings>).get_input_set(s) -> InputSet1
MPStaticGenerator(<static settings>).get_input_set(s) -> InputSet2
InputSet1
and InputSet2
behave the same way by design - they have the same class methods and attributes, etc. But one defines a static calc and the other a relax. If we only compared based on the structure, we would conclude InputSet1 == InputSet2
. But that's not the case. They define different sets of calculation inputs.
Yes, what I mean is when would you be in no control of InputGen1 InputGen2 such that you have to ex-post check for equality of InputSets? My conceptualization is that you will decide you only want to use InputGen1 and InputGen2, and you supply unique structures to Gen1 and Gen2. By definition, the four InputSets will all be unique. There is no reason to check it at the end. In fact, checking at the end of InputSet guarantees that instead of checking only n structures for equality, you will have to check n*k InputSets. Given that checking is O(N^2), that scales extremely poorly.
If you insist on checking InputSets rather than ensuring uniqueness, then I would suggest you simply have a metadata that says this InputSet was created using InputGen1. In that case, you can do checks in two separate n and k checks rather than one giant n*k check.
Yes, what I mean is when would you be in no control of InputGen1 InputGen2 such that you have to ex-post check for equality of InputSets? My conceptualization is that you will decide you only want to use InputGen1 and InputGen2, and you supply unique structures to Gen1 and Gen2. By definition, the four InputSets will all be unique. There is no reason to check it at the end. In fact, checking at the end of InputSet guarantees that instead of checking only n structures for equality, you will have to check n*k InputSets. Given that checking is O(N^2), that scales extremely poorly.
If you insist on checking InputSets rather than ensuring uniqueness, then I would suggest you simply have a metadata that says this InputSet was created using InputGen1. In that case, you can do checks in two separate n and k checks rather than one giant n*k check.
Fair enough. Regardless of whether this is an efficient workflow or not though, I don't see any harm in having an equality operation in place for InputSet
. It inherits from MutableMapping
, behaves similar to a dictionary, and therefore I think it's reasonable that users might expect to be able test equality.
an equality check is also useful for unit testing, where you might want to do
is1 = SomeInputSet.from_directory(<test_dir>)
is2 = SomeInputGenerator().get_input_set(s)
assert is1==is2
I think this is how @orionarcher discovered the problem documented here and in the linked issue.
I'd like to offer another perspective to the one above. I've never used VASP and I have never used the Pymatgen Structure
. I have been using the new pymatgen IO while developing the pymatgen.io.openmm
add-on package. There, as @rkingsbury said, I test equality in the unit tests to make sure that writing objects to and from MSON does not alter them. Since I am mixing together code from the openmm
and pymatgen
pacakges, this was not straightforward.
Testing equality has already been useful for me and I think it's worth keeping in mind use cases beyond our immediate development workflows, especially in light of the new pymatgen.addon
infrastructure.
I think we need to distinguish between unittest checking - an esoteric thing that just checks if something works as intended, vs an actual functionality that is encouraged for users. E.g., when we do unittest checking, we can do really blunt stuff like comparing strings of input files. Terrible for efficiency and very prone to numerical issues, but nonetheless, a good enough thing for a lot of purposes.
Re the openmm, I am somewhat confused by the statement that pymatgen Structure is not used. I would think the whole idea of pymatgen.io.xxx is to go from a pymatgen Molecule/Structure to a set of input files?
Re the openmm, I am somewhat confused by the statement that pymatgen Structure is not used. I would think the whole idea of pymatgen.io.xxx is to go from a pymatgen Molecule/Structure to a set of input files?
It uses Molecule
to generate a set of input files
I feel our discussion is getting off track. As you often remind me @shyuep , we can't put so many guardrails in pymatgen that we guarantee users won't do silly things. If someone wants to put a single MPRester query in a for
loop and run it a million times (and thrash our API), we can't prevent that. InputSet
equality checking may not always be the most efficient way to accomplish a task, but it is clearly useful (as demonstrated for unit testing and as evidenced by the fact that @orionarcher , who has developed a very nice pymatgen addon package, needs it for further work). Moreover, we already have equality checking in place, this PR is just a bugfix.
I'd like to refocus on our choices here, which are 1) to move forward with the bugfix or 2) to remove InputSet
equality checking altogether, which will put a significant amount of extra work on @orionarcher to create good quality tests for the new OpenMM IO. Moreover as you point out @shyuep , doing so might require "blunt stuff like comparing strings of input files." that are "Terrible for efficiency and very prone to numerical issues". Resorting to approaches like that inherently reduces the quality of the tests in my opinion, because it's much easier to make a subtle mistake.
In the particular case of openmm, the input files are .xml based due to the way the code is constructed, so the effort to get tests working would be even more involved.
Needless to say, I favor option 1. Do you see any technical problems with using as_dict
as the basis for the equality check?
I'm not sure I'd agree that checking if something works as intended is esoteric, but I do see your point RE efficiency. In this case, perhaps prioritizing developer time over computational cost might be the right move, as the cost of IO will be dwarfed by the cost of simulations. In short, I'd support option 1.
In classical MD simulations we need to specify bonding ahead of time, so Molecule and Structure are not very convenient representations. The IO heavily uses the new pymatgen
IO paradigm and Molecule
s do enter the code in several places.
I favor not checking InputSets for equality. In the end, what you are saying is that instead of:
self.assertEqual(inputset1, inputset2)
the developer has to write
self.assertEqual(inputset1.as_dict(), inputset2.as_dict())
The latter doesn't seem like a lot of effort to me and is in fact used in many places for tests (other than strings).
My view is that the commitment to an __eq__
implementation is highly significant and requires rigorous definition. That is why Python objects by default do not implement __eq__
and only checks for equality of ids. Let me present a scenario - let's say someone supplies methane to an input set and say that person is not particularly careful and supplies a second methane that is rotated by 15 degrees to the first. For all intents and purposes, the two InputSets are equal for most computational codes - Gaussian, VASP, whatever you can think of. But using as_dict would result in the two InputSets as unequal.
The collorary I can give is Structure.__eq__
vs Structure.matches
. Here, we have a definition that __eq__
is meant to compare based on numerical closeness while matches
(a far more expensive method) actually handles site mappings. The latter is the "correct" way of doing everything. There are almost no instances where structure1 == structure2 is the correct method to call by a user. But there is an important use case in that we need Structure to be usable as a key in a dictionary that is being updated and that's where the __eq__
comes in. That "important to be usable as a key in a dict" is just about the only situation where I think it is useful to have a non-rigorous __eq__
definition. This applies to things like ComputedEntry as well.
Certainly I agree we do not need to put too many guardrails on users. But here, we are actively implementing a feature that encourages bad use with very limited good use case (unittest is certainly one good use case, but like I pointed out, it is not a user
use-case and it can be dealt with using minimal additional code). In essence, we are giving users a way to do an O((Nk)^2) comparison instead of an O(N^2) comparison with zero difference in outcomes. I also don't see InputSets being used as keys in a dict. In fact, usage as a key would violate the general concept that "sets" are usually non-hashable.
My view is that the commitment to an
__eq__
implementation is highly significant and requires rigorous definition. That is why Python objects by default do not implement__eq__
and only checks for equality of ids.
I see; I didn't realize that there was a more general architectural decision to not implement __eq__
methods. You make a good point that we can just use InputSet1.as_dict() == InputSet2.as_dict()
in tests. That's an equally good technical solution that doesn't impose extra overhead on developers.
Maybe the best course of action then is to remove the __eq__
method. @orionarcher , if you agree, would you mind updating your PR accordingly?
I've opened a PR to remove eq so I think we can close this one. Hopefully this will not cause too much disruption to ongoing development of inherited classes.
I'll close this as completed by #2671.