openff-toolkit icon indicating copy to clipboard operation
openff-toolkit copied to clipboard

Molecule.from_x methods should treat disconnected structures as separate molecules

Open mattwthompson opened this issue 3 years ago • 4 comments

SMILES strings can store "disconnected molecules" like a ligand with a counter-ion. In OpenFF world, these should be treated as separate molecules - they should be parameterized separately, so they should be stored separately.

Probably the simplest example is NaCl:

>>> from openff.toolkit.topology.molecule import Molecule
>>> nacl = Molecule.from_smiles("[Na+].[Cl-]")
>>> assert type(nacl) is list, "nope, looks like it's read in as a single molecule"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError: nope, looks like it's read in as a single molecule

This should probably return a list of Molecules, one for each ion. Presumably there's a way for this case to be processed by the wrapped toolkits.

There are probably other points in the infrastructure this behavior should be considered; Molecule.from_smilies is just where I happened to come across it.

mattwthompson avatar Jul 06 '21 16:07 mattwthompson

@j-wags suggested the following solution on Slack, which @davidlmobley agreed with:

In the short term, we should have that from_smiles call raise an error. In the long term, we can discuss whether to have that return two molecules (though that would change the return type of from_smiles to sometimes be multiple molecules, which is a bad practice)

mattwthompson avatar Jul 06 '21 18:07 mattwthompson

Structure file records may have multiple components as well. Consider from_file("salts.smi") containing:

[Na+].[Cl-] table
[Cl-].[NH4+] liquorice

or the corresponding SDF.

I agree it would be error-prone to have a function sometimes return a molecule and other times return a list of molecules. Perhaps from_multicomponent_smiles() and from_multicomponent_file()?

adalke avatar Jul 06 '21 19:07 adalke

Perhaps from_multicomponent_smiles() and from_multicomponent_file()?

That's an interesting idea. I'd been thinking that it'd be safest/most generic to make the return type of from_file (and possibly now from_smiles) always be an iterable of Molecules, even if there's just one Molecule in it most of the time. Though the ...multicomponent... calls could actually offer us a reverse-compatible way forward, where we keep the old from_file, but then the from_multicomponent_file would be the thing that always returns a list!

j-wags avatar Jul 06 '21 19:07 j-wags

This sort of basic validation should also check for atoms with atomic number 0 ("*" atoms in SMILES, R-groups in molfiles). Plus probably also unsupported bond types.

adalke avatar Jul 12 '21 15:07 adalke