openff-toolkit
openff-toolkit copied to clipboard
Molecule.from_x methods should treat disconnected structures as separate molecules
SMILES strings can store "disconnected molecules" like a ligand with a counter-ion. In OpenFF world, these should be treated as separate molecules - they should be parameterized separately, so they should be stored separately.
Probably the simplest example is NaCl:
>>> from openff.toolkit.topology.molecule import Molecule
>>> nacl = Molecule.from_smiles("[Na+].[Cl-]")
>>> assert type(nacl) is list, "nope, looks like it's read in as a single molecule"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError: nope, looks like it's read in as a single molecule
This should probably return a list of Molecule
s, one for each ion. Presumably there's a way for this case to be processed by the wrapped toolkits.
There are probably other points in the infrastructure this behavior should be considered; Molecule.from_smilies
is just where I happened to come across it.
@j-wags suggested the following solution on Slack, which @davidlmobley agreed with:
In the short term, we should have that
from_smiles
call raise an error. In the long term, we can discuss whether to have that return two molecules (though that would change the return type of from_smiles to sometimes be multiple molecules, which is a bad practice)
Structure file records may have multiple components as well. Consider from_file("salts.smi") containing:
[Na+].[Cl-] table
[Cl-].[NH4+] liquorice
or the corresponding SDF.
I agree it would be error-prone to have a function sometimes return a molecule and other times return a list of molecules. Perhaps from_multicomponent_smiles()
and from_multicomponent_file()
?
Perhaps from_multicomponent_smiles() and from_multicomponent_file()?
That's an interesting idea. I'd been thinking that it'd be safest/most generic to make the return type of from_file
(and possibly now from_smiles
) always be an iterable of Molecule
s, even if there's just one Molecule
in it most of the time. Though the ...multicomponent...
calls could actually offer us a reverse-compatible way forward, where we keep the old from_file
, but then the from_multicomponent_file
would be the thing that always returns a list!
This sort of basic validation should also check for atoms with atomic number 0 ("*
" atoms in SMILES, R-groups in molfiles). Plus probably also unsupported bond types.