openff-toolkit
openff-toolkit copied to clipboard
Should we add a Molecule.from_xyz() to allow easy perception of chemistry from dataset that lack chemical information?
Is your feature request related to a problem? Please describe.
I repeatedly hear from users that have erased the chemical information in their molecule (transforming into something like an XYZ or PDB file that only contains elements and 3D coordinates) that they would like a convenient way to create a Molecule
to use OpenFF toolkit features, even if the heuristics for reconstructing chemical information are error-prone. It seems like we should support some path for doing this conveniently, but perhaps require an optional argument in the constructor or a special API point that requires the user to implicitly acknowledge this is a less desirable path to constructing a molecule.
Describe the solution you'd like
We could support either a constructor or static method that enables file formats that lack chemical information to create a Molecule
object, provided a special API point or optional argument is provided that acknowledges this approach is likely error-prone.
molecule = Molecule('xyzfile.xyz', use_unsafe_chemical_perception=True)
molecule = Molecule.from_xyz_unsafe(xyzfile)
- The OpenEye toolkit provides a heuristic to perceive chemistry from positions.
- RDKit provides a way to construct a molecule from a PDB file block
- AmberTools provides a way to perceive connectivity and formal charges from a PDB/XYZ file and write an SDF file using a separate set of heuristics (with drawbacks described here).
Describe alternatives you've considered This problem comes up frequently, so having an explicit API for dealing with it, together with explicit documentation on why this is not the preferred route, is a better practice than the current state of things.
Hm, this a big source of user confusion, so something like this for XYZ (and maybe some PDBs that have element info) could be nice.
On the other hand, this adds to our scope, and could lead us toward being responsible for differences between guessing heuristics in different packages.
If we do implement this, I recommend the second API point you suggested.
molecule = Molecule.from_xyz_unsafe(xyzfile)
I'd be happy to lay out more of a spec for this if someone wants to take it on. For now, since workarounds are available (and the workarounds indicate to the user that we're not responsible for the correctness of the guesses), I'll tag this as medium priority.
Yes, at some level this is a cheminformatics problem and my/our preference for this has been that we depend on the cheminformatics engines for this since there are deep issues here we don't want to own.
I do like the idea of having "unsafe" in the function name if we ever do this. :) Or we could go more extreme and say "from_xyz_shootyourselfinthefoot".
molecule = Molecule.from_xyz_unsafe(xyzfile)
If you made the entrypoint from_unsafe_xyz_data
or something that does not depend on it being a file, you could route from_qcschema
through it too.
FWIW in #1148 (should be live in v0.10.2) I added a stopgap that at least ensures users are provided a little more direction if attempting to read XYZ files with the current toolkit:
$ python
>>> from openff.toolkit.topology.molecule import Molecule
>>> Molecule.from_file("foo.xyz")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mwt/software/openforcefield/openff/toolkit/topology/molecule.py", line 4748, in from_file
raise UnsupportedFileTypeError(
openff.toolkit.utils.exceptions.UnsupportedFileTypeError: Parsing `.xyz` files is not currently supported because
they lack sufficient chemical information to be used with SMIRNOFF force fields. For more information, see
https://open-forcefield-toolkit.readthedocs.io/en/latest/faq.html or to provide feedback please visit
https://github.com/openforcefield/openff-toolkit/issues/1145.
The linked issue is this issue, so users may drop by to offer feedback / express desire that something like this be added here.
This would be updated if Molecule.from_xyz_if_you_accept_the_risks
is implemented.