Molecule.enumerate_tautomers() can return empty list
Open
jchodera
opened this issue 3 years ago
•
14 comments
trafficstars
Describe the bugMolecule.enumerate_tautomers() can return an empty list---rather than the single tautomer or original molecule---if tautomer enumeration fails.
Here's an example of how one might normally expect to use these methods:
for protomer in offmol.enumerate_protomers():
for tautomer in protomer.enumerate_tautomers():
for stereoisomer in tautomer.enumerate_stereoisomers():
# process stereoisomer
If there is only one microstate for any of these enumeration methods (e.g. the molecule itself has only one protonation state, one tautomer, and/or one stereoisomer), it should clearly return an iterable containing only itself.
This seems complicated because it will change the outputs of qcsubmit, which is an important artifact generator for our science team and is essential to the reproducibility of our work. Will we need a plan to coordinate changes with qcsubmit and qca-dataset-submission?
It will also require similar changes to enumerate_stereoisomers and enumerate_protomers.
enumerate_stereoisomers will be complicated - should it include the input if it had unassigned stereocenters? And is it OK if OE and RDKit disagree over what constitutes an unidentified stereocenter (#146)? Also what if the user requests rationalise=True but the input isn't rationalizable?
This seems like it would be a behavior change and expose a bunch of edge cases for little value.
Would an argument like return_self help here, where the default is the current default behaviour? iirc there were inconsistencies with the behaviour of enumerate_stereoisomers and enumerate_protomers, as detailed in https://github.com/openforcefield/openff-toolkit/issues/738.
Thanks for pulling up that link, totally missed that there was a previous issue on this.
Adding return_self still doubles the number of possible input permutations that we need to handle, so that doesn't seem like an improvement from the "cost to make and maintain" side. And if we made it default to True then we'd still have the same "behavior change" problems as above, or if it was False users would continue complaining that the default behavior is surprising.
I'm going to tag this as medium priority, because Matt and I (and apparently Trevor and I) have put several hours into this and it's just getting more complex. Meanwhile there's lower-hanging, higher-value fruit elsewhere that needs fixing.
To turn this around, since it sounds like there isn't much appetite to make the API behave in an obvious and consistent manner or to document the current non-obvious implemented behavior:
If I want to enumerate all the microstates of a Molecule (protonation states, tautomers, and undefined stereochemistry [if any stereocenters/bonds are undefined]) using the current API, how would I do this?
for molecule in molecules:
try:
tautomers = molecule.enumerate_tautomers(
max_states=self.max_tautomers, toolkit_registry=toolkit_registry
)
for tautomer in tautomers:
result.add_molecule(tautomer)
# add the input molecule
result.add_molecule(molecule=molecule)
except Exception:
result.filter_molecule(molecule)
which could be replaced with
for tautomer in molecule.enumerate_tautomers(
max_states=self.max_tautomers, toolkit_registry=toolkit_registry):
result.add_molecule(molecule=tautomer)
with the proposed change in API to have enumerate_tautomers enumerate the list of reasonable tautomers.
Note also that this except Exception: is probably a very bad idea, since it could result in even minor unexpected issues (e.g. missing dependencies) silently filtering out molecules rather than passing on critical errors.
Molecule.enumerate_stereoisomers is called here and is much more confusing:
for molecule in molecules:
try:
isomers = molecule.enumerate_stereoisomers(
undefined_only=self.undefined_only,
max_isomers=self.max_isomers,
rationalise=self.rationalise,
toolkit_registry=toolkit_registry,
)
# now check that each molecule is well defined
for isomer in isomers:
if not check_missing_stereo(isomer):
result.add_molecule(isomer)
# now check the input
# rationalise if needed
if self.rationalise:
molecule.generate_conformers(n_conformers=1)
if not check_missing_stereo(molecule):
result.add_molecule(molecule)
except Exception:
result.filter_molecule(molecule)
If this behavior is needed to enumerate all undefined stereocenters/bonds, it would be sensible to incorporate it into Molecule.enumerate_stereoisomers since it will be difficult for others to follow that this is the idiom one would need to enumerate undefined stereocenters/bonds.
Finally, Molecule.enumerate_protomers is called here:
for molecule in molecules:
try:
protomers = molecule.enumerate_protomers(max_states=self.max_states)
for protomer in protomers:
result.add_molecule(protomer)
result.add_molecule(molecule)
except Exception:
# if openeye is not available this will fail
result.filter_molecule(molecule)
which is straightforward, but still is a mismatch for the API call---which seems to be "enumerate alternative protonation states that are not this one".
Since there is also no straightforward method to correctly enumerate all microstates of a molecule (e.g. Molecule.enumerate_microstates), this leads to potential issues if these enumeration schemes are also called in incorrect order. It appears that the default workflows in QCSubmit run afoul of this issue: In the openff.qcsubmit.state_enumeration workflow, these are called in this order:
Because the enumerate_tautomers step can change the number of protons on a nitrogen, this could create or delete stereocenters at nitrogens (e.g. on a piperazine ring) which might fail to be enumerated. My understanding is that the correct order should be
But it's easy to see why this could potentially cause confusion.
My recommendations would be:
Molecule.enumerate_protomers() should enumerate all reasonable protonation states of the molecule (including the input state)
Molecule.enumerate_tautomers() should enumerate all reasonable tautomeric states of the molecule (including the input state)
Molecule.enumerate_stereoisomers() should enumerate all uncertain/undefined stereocenters/bonds of the molecule (including the input state)
Molecule.enumerate_microstates() should call all of these in the correct order (protomers, tautomers, stereoisomers) to avoid confusion
Exceptions for each of these should be clearly documented in the docstrings so they can be properly handled.
QCSubmit should avoid except Exception which might drop microstates or molecules silently
QCSubmit should correct the enumeration order in its default workflow and/or offer the simpler EnumerateMicrostates that will do this correctly via Molecule.enumerate_microstates()
I totally understand there are higher-priority issues to address, of course, but it would be great to tidy this up while folks are still updating to the new API.
From what I remember and what @jchodera has posted above, QCSubmit also has to work around the issue of the input molecule not being returned by the method, so this change could slightly simplify it. What's more, if this would make the behaviour consistent with the backend toolkits this could help make usage easier.
On the other hand, this current behaviour is documented in the returns sections of each function maybe this should also be reflected in the top description?
It appears that the default workflows in QCSubmit run afoul of this issue
That's not quite right QCSubmit makes no assumptions about what the user might want to run and they are free to construct each workflow in any order they see fit so they could end up in this situation and I agree that a workflow component such a EnumerateMicrostates would be great! The ordering you refer to though is just alphabetical and provided by isort.
Sure - Happy to accept a fix! Recommendations 1-3, 4, and 5 would probably be best as separate PRs to the OFF Toolkit, and 6+7 would be to QCSubmit. They seem separable so the PRs can be made in any order and won't block each other.