openff-toolkit icon indicating copy to clipboard operation
openff-toolkit copied to clipboard

Remove dummy atoms

Open lilyminium opened this issue 4 years ago • 7 comments
trafficstars

Describe the bug A clear and concise description of what the bug is.

~Checking molecule equality for molecules with dummy atoms raises an error because the equality check first checks the hill_formula, which requires elements. A dummy atom (atomic number 0) does not have an element and therefore raises a KeyError.~

The toolkit looks like it supports dummy atoms, but it doesn't. It should validate the input lest this come as a surprise to users. Please ignore the remainder of the issue, which was made under the assumption that dummy atoms are supported.

To Reproduce Steps to reproduce the behavior. A minimal reproducing set of python commands is ideal.

If the problem involves a specific molecule or file, please upload that as well.

from openff.toolkit.topology import Molecule
mol = Molecule.from_smiles("[*]H")
mol == mol

Output The full error message (may be large, that's fine. Better to paste too much than too little.)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-107-3735b8ff570b> in <module>
      1 mol = Molecule.from_smiles("[*]H")
----> 2 mol == mol

openff/toolkit/topology/molecule.py in __eq__(self, other)
   2226         # updated to use the new isomorphic checking method, with full matching
   2227         # TODO the doc string did not match the previous function what matching should this method do?
-> 2228         return Molecule.are_isomorphic(self, other, return_atom_map=False)[0]
   2229 
   2230     def to_smiles(

openff/toolkit/topology/molecule.py in are_isomorphic(mol1, mol2, return_atom_map, aromatic_matching, formal_charge_matching, bond_order_matching, atom_stereochemistry_matching, bond_stereochemistry_matching, strip_pyrimidal_n_atom_stereo, toolkit_registry)
   2576 
   2577         # Do a quick hill formula check first
-> 2578         if Molecule.to_hill_formula(mol1) != Molecule.to_hill_formula(mol2):
   2579             return False, None
   2580 

openff/toolkit/topology/molecule.py in to_hill_formula(molecule)
   3890         from collections import Counter
   3891 
-> 3892         atom_symbol_counts = Counter(
   3893             Element.getByAtomicNumber(atom_num).symbol for atom_num in atom_nums
   3894         )

~/anaconda3/envs/polymetrizer-viz/lib/python3.8/collections/__init__.py in __init__(self, iterable, **kwds)
    550         '''
    551         super(Counter, self).__init__()
--> 552         self.update(iterable, **kwds)
    553 
    554     def __missing__(self, key):

~/anaconda3/envs/polymetrizer-viz/lib/python3.8/collections/__init__.py in update(self, iterable, **kwds)
    635                     super(Counter, self).update(iterable) # fast path when counter is empty
    636             else:
--> 637                 _count_elements(self, iterable)
    638         if kwds:
    639             self.update(kwds)

openff/toolkit/topology/molecule.py in <genexpr>(.0)
   3891 
   3892         atom_symbol_counts = Counter(
-> 3893             Element.getByAtomicNumber(atom_num).symbol for atom_num in atom_nums
   3894         )
   3895 

simtk/openmm/app/element.py in getByAtomicNumber(atomic_number)
    105     @staticmethod
    106     def getByAtomicNumber(atomic_number):
--> 107         return Element._elements_by_atomic_number[atomic_number]
    108 
    109     @staticmethod

KeyError: 0

Computing environment (please complete the following information):

  • Operating system
  • Output of running conda list

Additional context Add any other context about the problem here.

Working molecule:

mol = Molecule.from_smiles("[C]H")
mol == mol

In general the KeyError leads to some unexpected behaviour. Could the toolkit return an element of None for all invalid atomic numbers?

lilyminium avatar Jun 09 '21 15:06 lilyminium

Hm, in general I don't think we want to support dummy atoms -- Anything that becomes an OFFMol should be a "real molecule" that can be simulated. Though maybe I'm missing a use case here?

CC @jeff231li, who had asked about dummy atoms before (though I don't think he wanted them in a molecule's graph, just as a placeholder in the topology)

j-wags avatar Jun 09 '21 17:06 j-wags

In that case I guess you'd want to validate inputs from SMILES, RDKit etc. because at the moment it does look like dummy atoms are supported -- I predict that people like me will continue to raise similar issues until this is made clear. My use case is that I was writing code inspired by @SimonBoothroyd's constructure to build molecules and using the OpenFF Molecule to hold information as a way to straddle OpenEye and RDKit :)

lilyminium avatar Jun 09 '21 18:06 lilyminium

To put it another way, as a user I would expect any Molecule I successfully created to be able to use the methods associated with Molecule, particularly something as fundamental as __eq__. I don't mind if I can't successfully create a Molecule with dummy atoms, but since I can, I am surprised at the behaviour reported here.

Although I hope I can continue to make Molecules with dummy atoms so I don't have to re-write my code 😅

lilyminium avatar Jun 09 '21 19:06 lilyminium

Still getting used to using GitHub. Left a comment in the PR at https://github.com/openforcefield/openff-toolkit/pull/978 rather than realizing the discussion was here.

In short, adding dummy atom support (wildcard "*" atom in SMILES, "R" atom in SDF) requires changing simtk to have an Element with atomic number 0, or requires patching around its lack of support.

adalke avatar Jun 14 '21 20:06 adalke

My use case is that I was writing code inspired by @SimonBoothroyd's constructure to build molecules and using the OpenFF Molecule to hold information as a way to straddle OpenEye and RDKit :)

I took a look at that package. I think there are a couple of approaches you could take.

One is to switch to using some other atom as the intermediate, like "Np" instead of "*". It's unlikely you'll find neptunium in your data set!

Another might be to do the enumeration completely in SMILES space, rather than going through intermediate wildcards and a toolkit-based reaction. Here's a decidedly incomplete sketch, which depends on the existing layout with ([R1]), ([R2]) ... in the scaffold SMILES and [R] as the first three characters of the R-group SMILES extension:

import re
from dataclasses import dataclass
from typing import Dict, List
import itertools

@dataclass
class Scaffold:
    smiles: str
    r_groups: Dict[int, List[str]]

SCAFFOLDS = {
    # cation
    # anion
    # carbonyl compound
    "aldehyde": Scaffold(
        smiles="C([R1])(=O)", r_groups={1: ["hydrogen", "alkyl", "aryl"]}
    ),
    "ketone": Scaffold(
        smiles="C([R1])(=O)([R2])",
        r_groups={1: ["alkyl", "aryl"], 2: ["alkyl", "aryl"]},
    ),
}

_functional_groups = {
    "hydrogen": {
        "hydrogen": "[R][H]",
        },
        
    # Alkanes
    "alkyl": {
        "methyl": "[R]C",
        "ethyl": "[R]CC",
        "isopropyl": "[R]C(C)(C)",
        "tert-butyl": "[R]C(C)(C)(C)",
        },
    # Aryl
    "aryl": {
        "phenyl": "[R]c1ccccc1",
        "pyridyl": "[R]c1ccncc1",
        "benzyl": "[R]Cc1ccccc1",
        },
    }
FUNCTIONAL_GROUPS = {group_name: list(r_groups.values()) for (group_name, r_groups) in _functional_groups.items()}

SUBSTITUENTS = {}
for group_name, group_substituents in _functional_groups.items():
    SUBSTITUENTS.update(group_substituents)


_Rn_atom_pat = re.compile(r"\(\[R([1-9])+]\)")

def enumerate_combinations(scaffold, substituents):
    smiles = scaffold.smiles
    parts = []

    # Remove the leading '[R]' and surround by "()"s for substitution in ([R1]), etc. matches
    trimmed_substituents = {}
    for r_group_idx, r_groups in substituents.items():
        trimmed_r_groups = []
        for r_group in r_groups:
            if not r_group.startswith("[R]"):
                raise ValueError(f"R-groups must start with '[R]': {r_group}")
            trimmed_r_groups.append("(" + r_group[3:] + ")")
        trimmed_substituents[r_group_idx] = trimmed_r_groups
        
    # Find the ([R1]) etc. matches and the intermediate strings
    pos = 0
    for match in _Rn_atom_pat.finditer(smiles):
        start = match.start()
        end = match.end()
        if start > pos:
            parts.append([smiles[pos:start]]) # single element substring

        r_group = int(match.group(1))
        parts.append(trimmed_substituents[r_group])

        pos = end
    if pos < len(scaffold.smiles): # Final part of the SMILES
        parts.append( [scaffold.smiles[pos:]] )

    for terms in itertools.product(*parts): # Enumerate the component parts
        yield "".join(terms)

print(list(enumerate_combinations(
    SCAFFOLDS["ketone"], {
        1: ["[R]C", "[R]c1ccccc1"],
        2: FUNCTIONAL_GROUPS["alkyl"],
        })))

This generates:

['C(C)(=O)(C)', 'C(C)(=O)(CC)', 'C(C)(=O)(C(C)(C))', 'C(C)(=O)(C(C)(C)(C))', 'C(c1ccccc1)(=O)(C)', 'C(c1ccccc1)(=O)(CC)', 'C(c1ccccc1)(=O)(C(C)(C))', 'C(c1ccccc1)(=O)(C(C)(C)(C))']

Note that I don't touch the classification or validation in the existing code! But if you only need this chemistry-toolkit-agnostic enumeration, then perhaps something like the above, once debugged, might work in an OpenFF environment?

adalke avatar Jun 14 '21 21:06 adalke

Thanks for taking a look @adalke! The code I'm referring to is actually the much less mature polymetrizer package that doesn't use substituents in the way Simon does, where each substituent has one R-group; rather, the input is multiple molecules with potentially multiple R-groups, which are each treated as Monomers or "residues" in a putative OpenFF residue-forcefield framework. polymetrizer enumerates combinations of them (Oligomers) and aggregates parametrized systems to build a residue-based force field. Right now I'm using it to check if AM1BCC charges assembled this way, from parts of a larger molecule, fall within a tolerable range of charges calculated for the entire larger molecule.

You're probably right that the whole thing could be done in SMARTS/SMILES space, though. I'm kicking myself for not thinking of that earlier... and it would be faster.

lilyminium avatar Jun 16 '21 04:06 lilyminium

No need to kick yourself. I know how long it took me to realize how much cheminformatics could be done as syntax manipulation in SMILES space rather than molecular graph manipulation in toolkit space, and I continue to come up with new ways to work directly on SMILES strings.

One that I developed a few years ago was to re-write attachment points, noted using a wildcard like *C(P)CCO into bond closure notation like C%90(P)CCO so components could be joined like: C%90(P)CCO.CC%90. I call this method "welding", and used it in my matched molecular pair program mmpdb to join multiple fragments.

(Note that it's a bit trickier to handle stereochemisty correctly as moving the "*" form from the left to a closure on the right require swapping "@" and "@@" chirality, and that technique only works for tetrahedral stereochemistry .... which is all that RDKit supports. :) )

Many years ago there was a paper on using the the same idea - bond-closures + dot-disconnect - to turn combinatorial enumeration into a SMILES string combinatorial enumeration followed by canonicalization. See http://www.dalkescientific.com/writings/diary/archive/2004/12/12/library_generation_with_smiles.html where I describe the approach. (The paper was published earlier but I didn't know about it.)

adalke avatar Jun 16 '21 05:06 adalke