openff-toolkit
openff-toolkit copied to clipboard
oemol.GetConfs() consuming large amount of memory even when no conformers are present
Describe the bug
Not a bug per se, but could impact on toolkit usability for large molecules -- while debugging https://github.com/openforcefield/openff-nagl/issues/101 I saw that converting molecules to and from OpenEye consumes a large amount of memory that is not seen with RDKit. For a 5177 atom protein, calling Molecule.from_openeye consumes about 800 MiB. Memray attributes most of this to oeconf.GetCoords, even though no conformers are generated or attached at any point to the molecule. Would it be possible to check for conformers before calling conf.GetCoords? (It may be that this triggers the same memory-consuming process, though!)
To Reproduce
mre.py (also attached):
from openff.toolkit import Molecule
protein = Molecule.from_smiles(
"CC[C@H](C)[C@H](NC(=O)CNC(=O)CNC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](CCCNC(N)=[NH2+])NC(=O)CNC(=O)[C@H](CS)NC(=O)[C@@H](NC(=O)[C@H](CCCNC(N)=[NH2+])NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@@H](NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)[C@H](CC(C)C)NC(=O)CNC(=O)[C@H](CCCNC(N)=[NH2+])NC(=O)[C@H](CCC(=O)[O-])NC(=O)[C@@H](NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@@H](NC(=O)[C@@H]1CCCN1C(=O)[C@H](CC(N)=O)NC(=O)[C@@H](NC(=O)[C@H](C)NC(=O)[C@H](C)NC(=O)[C@@H]1CCCN1C(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@H](CCC(=O)[O-])NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O)CNC(=O)[C@H](CC(N)=O)NC(=O)[C@H](CCC(=O)[O-])NC(=O)[C@H](C)NC(=O)[C@H](Cc1ccc(O)cc1)NC(=O)[C@@H](NC(=O)[C@H](CCC(=O)[O-])NC(=O)[C@H](CO)NC(=O)[C@H](C)NC(=O)[C@@H]1CCCN1C(=O)[C@@H](NC(=O)[C@H](CO)NC(=O)[C@@H](NC(=O)CNC(=O)[C@H](CO)NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](Cc1cnc[nH]1)NC(=O)CNC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CCCC[NH3+])NC(=O)[C@@H](NC(=O)CNC(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](C)NC(=O)[C@@H](NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CCCC[NH3+])NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](C)NC(=O)[C@@H](NC(=O)[C@H](CO)NC(=O)[C@@H](NC(=O)CNC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CCCC[NH3+])NC(=O)[C@@H]1CCCN1C(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](CO)NC(=O)[C@@H](NC(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](CC(=O)[O-])NC(=O)CNC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@H](CCCC[NH3+])NC(=O)[C@@H](NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CCC(=O)[O-])NC(=O)CNC(=O)[C@H](CC(N)=O)NC(=O)[C@@H](NC(=O)CNC(=O)CNC(=O)[C@H](C)NC(=O)[C@H](C)NC(=O)[C@H](CC(N)=O)NC(=O)[C@@H](NC(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](CCCC[NH3+])NC(=O)[C@H](C)NC(=O)[C@H](C)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CCC(=O)[O-])NC(=O)[C@H](C)NC(=O)CNC(=O)[C@H](CCCC[NH3+])NC(=O)[C@H](CCC(N)=O)NC(=O)[C@@H](NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](C)NC(=O)CNC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](C)NC(=O)[C@H](C)NC(=O)[C@H](CC(N)=O)NC(=O)[C@@H]1CCCN1C(=O)CNC(=O)[C@@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H]1CCCN1C(=O)[C@H](C)NC(=O)CNC(=O)[C@@H](NC(=O)[C@H](C)NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@H](CC(=O)[O-])NC(=O)[C@@H]([NH3+])CCSC)C(C)C)C(C)C)[C@@H](C)CC)C(C)C)[C@@H](C)O)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)CC)C(C)C)C(C)C)[C@@H](C)CC)C(C)C)C(C)C)C(C)C)C(C)C)C(C)C)C(C)C)[C@@H](C)CC)C(C)C)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)O)[C@@H](C)O)C(C)C)[C@@H](C)O)[C@@H](C)O)[C@@H](C)O)C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](Cc1c[nH]cn1)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](C)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](Cc1cnc[nH]1)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@H](C(=O)N1CCC[C@H]1C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)NCC(=O)N[C@@H](CCC(N)=O)C(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@H](C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(N)=O)C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@H](C(=O)N[C@@H](CCSC)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CCC(=O)[O-])C(=O)NCC(=O)N[C@H](C(=O)N[C@@H](CC(N)=O)C(=O)N[C@H](C(=O)N[C@@H](CS[C@H]1CC(=O)N(c2ccc3c(c2)C(=O)OC32c3ccc(O)cc3Oc3cc(O)ccc32)C1=O)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CO)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCSC)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)NCC(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1c[nH]cn1)C(=O)N[C@H](C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](CCCNC(N)=[NH2+])C(=O)N[C@@H](CCC(N)=O)C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](CCC(N)=O)C(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](C)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC(=O)[O-])C(=O)NCC(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)N[C@@H](CC(N)=O)C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](C)C(=O)N[C@@H](CO)C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](C)C(=O)N[C@H](C(=O)N[C@@H](CCC(=O)[O-])C(=O)NCC(=O)N[C@H](C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CC(N)=O)C(=O)N[C@H](C(=O)N[C@@H](Cc1ccccc1)C(=O)NCC(=O)N1CCC[C@H]1C(=O)N[C@@H](CC(=O)[O-])C(=O)N1CCC[C@H]1C(=O)N[C@H](C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCNC(N)=[NH2+])C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CC(N)=O)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CC(N)=O)C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](C)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@H](C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CO)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](CCSC)C(=O)N[C@@H](CCC(N)=O)C(=O)N[C@@H](C)C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](CO)C(=O)N[C@H](C(=O)N[C@@H](CCC(=O)[O-])C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCCC[NH3+])C(=O)NCC(=O)N[C@@H](CO)C(=O)N[C@@H](Cc1ccccc1)C(=O)N1CCC[C@H]1C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)N[C@@H](CC(C)C)C(=O)NCC(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CCCC[NH3+])C(=O)NCC(=O)N[C@@H](CC(=O)[O-])C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CC(C)C)C(=O)N1CCC[C@H]1C(=O)NCC(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@H](C(=O)N[C@@H](CCSC)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCCC[NH3+])C(=O)NCC(=O)N1CCC[C@H]1C(=O)N[C@@H](CC(=O)[O-])C(=O)NCC(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@H](C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@H](C(=O)N[C@@H](CCC(N)=O)C(=O)N[C@@H](CCC(N)=O)C(=O)NC)[C@@H](C)CC)[C@@H](C)O)C(C)C)[C@@H](C)CC)[C@@H](C)O)C(C)C)C(C)C)[C@@H](C)CC)[C@@H](C)O)C(C)C)[C@@H](C)O)[C@@H](C)O)[C@@H](C)O)C(C)C)[C@@H](C)CC)C(C)C)[C@@H](C)CC)C(C)C)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)O)[C@@H](C)CC)[C@@H](C)CC)C(C)C)[C@@H](C)CC)C(C)C)C(C)C)C(C)C)[C@@H](C)O)C(C)C)[C@@H](C)O)[C@@H](C)O)[C@@H](C)CC)[C@@H](C)CC)C(C)C"
)
oemol = protein.to_openeye()
offmol = Molecule.from_openeye(oemol)
Requires memray installed:
>>> python -m memray run mre.py
The screenshot points to this line:
https://github.com/openforcefield/openff-toolkit/blob/97af5935b040556903d1056278a7eff3efe24c01/openff/toolkit/utils/openeye_wrapper.py#L1329
Output
Computing environment (please complete the following information):
- Operating system
- Output of running
conda list
Name Version Build Channel
─────────────────────────────────────────────────────────────────
openff-amber-ff-ports 0.0.4 pyhca7485f_0 conda-forge
openff-forcefields 2024.03.0 pyhca7485f_0 conda-forge
openff-interchange-base 0.3.25 pyhd8ed1ab_0 conda-forge
openff-models 0.1.2 pyhca7485f_0 conda-forge
openff-nagl 0.3.6 pyhd8ed1ab_0 conda-forge
openff-nagl-base 0.3.6 pyhd8ed1ab_0 conda-forge
openff-nagl-models 0.1.2 pyhd8ed1ab_0 conda-forge
openff-recharge 0.5.2 pyhd8ed1ab_0 conda-forge
openff-toolkit-base 0.15.2 pyhd8ed1ab_0 conda-forge
openff-units 0.2.2 pyhca7485f_0 conda-forge
openff-utilities 0.1.12 pyhd8ed1ab_0 conda-forge
Additional context
Manifest:
- mre.py (includes the protein smirks)
- memray-mre.py.10332.bin: the output of memray
- memray-flamegraph-mre.py.10332.html: the interactive graph in the screenshot
I tried implementing this since it should be easy, but it's not. Simply adding a NumConfs() call doesn't do the trick. I don't know how to check for OpenEye's annoying "courtesy conformer" without calling out to GetConfs(), which I understand to be the problem:
In [32]: oemol = Molecule.from_smiles("CCO").to_openeye()
In [33]: oemol.NumConfs()
Out[33]: 1
In [34]: [*oemol.GetConfs()][0].GetCoords()
Out[34]:
{0: (0.0, 0.0, 0.0),
1: (0.0, 0.0, 0.0),
2: (0.0, 0.0, 0.0),
3: (0.0, 0.0, 0.0),
4: (0.0, 0.0, 0.0),
5: (0.0, 0.0, 0.0),
6: (0.0, 0.0, 0.0),
7: (0.0, 0.0, 0.0),
8: (0.0, 0.0, 0.0)}
In [35]: molecule = Molecule.from_smiles("O=S(=O)(N)c1c(Cl)cc2c(c1)S(=O)(=O)NCN2")
In [36]: molecule.generate_conformers(n_conformers=1)
In [37]: oemol = molecule.to_openeye()
In [38]: oemol.NumConfs()
Out[38]: 1
In [39]: [*oemol.GetConfs()][0].GetCoords()
Out[39]:
{0: (1.8719326257705688, 3.7204949855804443, 2.2212681770324707),
1: (1.2912099361419678, 4.097604274749756, 0.9475870132446289),
2: (0.3753527104854584, 5.218091011047363, 0.8554574251174927),
3: (2.534075975418091, 4.290732383728027, -0.20339979231357574),
4: (0.4765625, 2.689453125, 0.296875),
5: (-0.5654296875, 2.794921875, -0.62060546875),
6: (-1.133737325668335, 4.327646255493164, -1.178165078163147),
7: (-1.181640625, 1.642578125, -1.119140625),
8: (-0.7685546875, 0.360595703125, -0.7216796875),
9: (0.280029296875, 0.290283203125, 0.2086181640625),
10: (0.90185546875, 1.43359375, 0.71923828125),
11: (0.84521484375, -1.2734375, 0.802734375),
12: (1.9853515625, -1.6318359375, -0.0174713134765625),
13: (0.93994140625, -1.193359375, 2.24609375),
14: (-0.484130859375, -2.26953125, 0.403076171875),
15: (-0.9970703125, -2.099609375, -0.96826171875),
16: (-1.4326171875, -0.7451171875, -1.2314453125),
17: (2.4708824157714844, 5.089303016662598, -0.8458374738693237),
18: (3.4932398796081543, 4.066527366638184, 0.08694052696228027),
19: (-2.0012941360473633, 1.7379204034805298, -1.8306996822357178),
20: (1.7089277505874634, 1.3463486433029175, 1.4417718648910522),
21: (-1.2067643404006958, -2.400458812713623, 1.1223875284194946),
22: (-0.20102126896381378, -2.3644607067108154, -1.6715291738510132),
23: (-1.8237838745117188, -2.7984254360198975, -1.1371092796325684),
24: (-2.3384859561920166, -0.6081215143203735, -1.6641637086868286)}
Okay, actually thinking about this a little more clearly, using GetConfs (which returns an iterator of all conformers) might be the issue if it's not a generator. I can't tell from the docs and SWIG magic if it's lazy like a generator or EEAAO more like a list.
There's also GetConfIter which only exists when there are two or more conformers. This could provide a useful branching point if it didn't fail to distinguish whether a single conformer was real or not when there's only one.
Hm, the courtesy conformer is annoying. This is low priority at best since it's a very moderate amount of memory even for a decent sized protein. Thanks for looking into it!