openff-toolkit
openff-toolkit copied to clipboard
Molecule attributes in multiprocessing lost
Dear openff team,
we are really happy with the openff-toolkit and successfully use it in our KinoML framework. Lately, we used the openff Molecule class in some experiments with multiprocessing (#97). We were surprised to see that custom attributes added to Molecule instances get lost when doing multiprocessing. Some online research pointed us to possible serialization issues of the Molecule class. Here is a simple script to reproduce the behavior:
from multiprocessing import Pool
from openff.toolkit.topology import Molecule
def dummy_function(molecule):
return molecule
if __name__ == '__main__':
molecules = []
for i, smiles in enumerate(["CCC", "CCCC", "COC"]):
molecule = Molecule.from_smiles(smiles)
molecule.custom = i
molecules.append(molecule)
print("Getting attributes without multiprocessing ...")
for molecule in molecules:
print(molecule.custom)
print("Doing some multiprocessing ...")
with Pool(processes=2) as pool:
molecules = pool.map(dummy_function, molecules)
print("Getting attributes after multiprocessing ...")
for molecule in molecules:
print(molecule.custom)
print("Finished!")
After doing some multiprocessing, the "custom" attribute is lost and we receive an AttributeError.
We actually also had problems in terms of serialization and came up with a strategy to serialize everything that is serializable. Maybe this could be a workaround to be considered (see here).
Thanks for looking into this, David
Thanks for the report; the problem seems to be that the custom arguments are lost during the pickle roundtrip.
>>> import pickle
>>> from openff.toolkit.topology import Molecule
>>> molecule = Molecule.from_smiles("O")
>>> molecule.custom = 4
>>> pickle.loads(pickle.dumps(molecule))
Molecule with name '' and SMILES '[H]O[H]'
>>> pickle.loads(pickle.dumps(molecule)).custom
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Molecule' object has no attribute 'custom'
This seems to be a common feature of pickling; we don't have much custom code there. We do have other serialization avenues, but unfortunately they also don't keep track of these at the moment, either:
>>> Molecule.from_dict(molecule.to_dict()).custom
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Molecule' object has no attribute 'custom'
>>> Molecule.from_yaml(molecule.to_yaml()).custom
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Molecule' object has no attribute 'custom'
If it's a matter of getting a single custom attribute through, some subclassing might do the trick. A more general solution from our end might enable some more custom behavior in the serialization round-trips - we'll have to think about that more!
This seems to be a common feature of pickling; we don't have much custom code there.
Here it seems to comes down to Molecule only serializing allowed attributes in to_dict, as defined in get/setstate.
https://github.com/openforcefield/openff-toolkit/blob/a4c184d09f78ef9d2911631126dc322d61ddae28/openff/toolkit/topology/molecule.py#L2638-L2642
If you subclass Molecule and use the more default method of returning and updating self.__dict__, you get your custom attributes back. I assume there's a reason for overriding the get/setstate methods, though, so this might not be the safest way to construct a molecule.
In [1]: import pickle
In [2]: from openff.toolkit.topology import Molecule
In [3]: class PickleMol(Molecule):
...: def __setstate__(self, dct):
...: self.__dict__.update(dct)
...:
...: def __getstate__(self):
...: return self.__dict__ # deepcopy for safety
...:
In [4]: molecule = PickleMol.from_smiles("O")
...: molecule.custom = 5
In [5]: pickle.loads(pickle.dumps(molecule)).custom
Out[5]: 5
we are really happy with the openff-toolkit and successfully use it in our KinoML framework
Thanks for the kind words, @schallerdavid 🙂
@lilyminium and @mattwthompson's responses are spot-on, but there may be an even simpler option for you.
If you're looking to store basic types (floats, ints, strings, or lists/dicts thereof), we also expose the Molecule.properties dict for exactly this sort of use case!
import pickle
from openff.toolkit.topology import Molecule
molecule = Molecule.from_smiles("O")
molecule.properties['my_int'] = 4
molecule.properties['my_str'] = '4'
molecule.properties['my_list'] = ['a', 3, ['b', 4]]
molecule.properties['my_dict'] = {1:'a', 'b':2.345}
pickle.loads(pickle.dumps(molecule)).properties
{'my_int': 4, 'my_str': '4', 'my_list': ['a', 3, ['b', 4]], 'my_dict': {1: 'a', 'b': 2.345}}
Thanks for all the different suggestions!
I just wanted to open this again to see if there was a more general and robust solution that could be adopted.
Retaining attributes on serialization or deepcopy of Molecule or subclasses seems like an important aspect of functionality if we want to encourage widespread use.
Is there a solution we can adopt within the toolkit that would enable this kind of use?
This - or at least the reproduction I copied here - is fixed by #1318, although there is a chance that may be reverted if there's evidence is broke anything else. That chance is anticipated to land in version 0.11.0. Unfortunately I don't have a timetable for a release.
(openff-ci) [openff-toolkit] git checkout upstream/master 15:29:11 ☁ 97462b88 ☂ ✭
HEAD is now at 97462b88 Bump actions/setup-python from 3 to 4 (#1317)
(openff-ci) [openff-toolkit] python 1136.py 15:29:17 ☁ 97462b88 ☂ ✭
Getting attributes without multiprocessing ...
0
1
2
Doing some multiprocessing ...
Getting attributes after multiprocessing ...
Traceback (most recent call last):
File "/Users/mattthompson/software/openff-toolkit/1136.py", line 28, in <module>
print(molecule.custom)
AttributeError: 'Molecule' object has no attribute 'custom'
(openff-ci) [openff-toolkit] git checkout upstream/topology-biopolymer-refactor 15:29:42 ☁ 97462b88 ☂ ✭
Previous HEAD position was 97462b88 Bump actions/setup-python from 3 to 4 (#1317)
HEAD is now at 8bc28a82 Do not override `__setstate__` or `__getstate__`' (#1318)
(openff-ci) [openff-toolkit] python 1136.py 15:29:48 ☁ 8bc28a82 ☂ ✭
Getting attributes without multiprocessing ...
0
1
2
Doing some multiprocessing ...
Getting attributes after multiprocessing ...
0
1
2
Finished!
Version 0.11.0 is released and fixes this - I re-ran the reproduction and it completes without error. I think this does ultimately stem from how __setstate__ and __getstate__ overrides accidentally changed pickling behavior.
The best solution for this functionality would be a complete rewrite of core object models to be based on Pydantic, but that's not something I can commit to working on anytime soon. I don't want to stifle any discussion on safe, stable serialization but I would encourage that to be moved to a separate issue from this one, which was narrow in scope (with a clean reproduction - thanks!) and fixed.