OPTIMADE icon indicating copy to clipboard operation
OPTIMADE copied to clipboard

SMILES property

Open merkys opened this issue 4 years ago • 8 comments

In #368 SMILES property for structures was proposed. After some discussion the following consensus emerged:

  • OpenSMILES specification for SMILES is to be used
  • Type of this property is String. Thus all queries on this property have to treat it as String, without analyzing underlying chemical structure.
  • For inorganic structures/parts, recommendations by Quirós et al. 2018 are suggested. (Disclosure: I am a co-author for this paper).

Fixes #368.

merkys avatar Dec 03 '21 16:12 merkys

I suppose this was hashed through long ago (apologies), but honestly, this makes no sense, and I think you would find users quite dissatisfied.

Q: What's the use case here?

The whole idea of SMILES is that it doesn't matter how the user chooses to format the SMILES. If this is implemented at a service, they should be expected to treat it as all standard SMILES-accepting services do (PubChem, NCI Chemical Identity Resolver, maybe COD?) -- with full SMILES semantics and the ability to locally canonicalize the request (typically) so that on their end they can do a regular string search. But that is an internal choice of the service. For example, they might transform the SMILES to a molecular graph and do the search internally that way (starting with molecular formula, for instance). If it were a regular string search I would have to have gone to this service previously, cached their SMILES variant string and then used it for a search later. Why would I ever do that?

"substructure searching" in the SMILES business means something different. Substructure searching is done using SMARTS, not SMILES. Perhaps someday SMARTS substructure searching could be implemented in OPTIMADE, but that is a separate issue.

If you want to refer to substructure searching, perhaps: "SMARTS substructure searching..."

Bob (in Mumbai, GMT+5:30)

On Tue, Jul 5, 2022 at 2:43 PM Matthew Evans @.***> wrote:

@.**** commented on this pull request.

In optimade.rst https://github.com/Materials-Consortia/OPTIMADE/pull/392#discussion_r913565736 :

@@ -2439,6 +2439,22 @@ chemical_formula_anonymous

  • A filter that matches an exactly given formula is :filter:chemical_formula_anonymous="A2B".

+smiles

+~~~~~~

+- Description: The SMILES (Simplified Molecular Input Line Entry System) representation of the structure.

+- Type: string

+- Requirements/Conventions:

    • Support: OPTIONAL support in implementations, i.e., MAY be :val:null.
    • Query: Support for queries on this property is OPTIONAL.
  • Queries MUST treat the value of this property as a raw string, without SMILES-specific semantics.

  • That is, providers MUST NOT perform substructure search, just regular string comparison.

    • Value MUST adhere to the OpenSMILES specification v1.0 <http://opensmiles.org/opensmiles.html>__.
    • When structures or their parts cannot be unambiguously represented in SMILES according to OpenSMILES recommendations, using the guidelines from Quirós et al. 2018 <https://doi.org/10.1186/s13321-018-0279-6>__ is RECOMMENDED.
    • Providers MAY canonicalize (i.e., use rules to establish stable order of atoms) produced SMILES representations, but this is not mandatory.
  • Generally, providers SHOULD NOT change the representation more frequently than the structure itself is modified.

Can we provide a couple of examples for people who are unfamiliar with SMILES (without needing them to click out of the spec and read the paper/OpenSMILES spec)? Below is taken from wikipedia, so please check if the string is actually OpenSMILES compliant... ⬇️ Suggested change

  • Generally, providers SHOULD NOT change the representation more frequently than the structure itself is modified.
  • Generally, providers SHOULD NOT change the representation more frequently than the structure itself is modified.

    • Examples:
  •  - caffeine: `CN1C=NC2=C1C(=O)N(C(=O)N2C)C`
    

In optimade.rst https://github.com/Materials-Consortia/OPTIMADE/pull/392#discussion_r913571896 :

@@ -2439,6 +2439,22 @@ chemical_formula_anonymous

  • A filter that matches an exactly given formula is :filter:chemical_formula_anonymous="A2B".

+smiles

+~~~~~~

+- Description: The SMILES (Simplified Molecular Input Line Entry System) representation of the structure.

I think we need a bit more clarification of the expected use.

How "much" of the structure should be described by the SMILES string for it to be valid here (e.g., that it should appear in the results when someone searches for it?) Do we need to require that every "site" in the OPTIMADE structure is present in the SMILES string? Obviously for nperiodic_dimensions=0 and a single molecule this makes sense, same for an nperiodic_dimensions=3 molecular crystal, but what about:

  • co-crystal with two distinct molecules (does SMILES do something fancy for this already?)
  • an inorganic surface with adsorbed molecule
  • a hybrid perovskite structure with molecular unit as a cation

— Reply to this email directly, view it on GitHub https://github.com/Materials-Consortia/OPTIMADE/pull/392#pullrequestreview-1028301605, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHNCW7T22W2FXQVETAB5V3VSP4CTANCNFSM5JKAXHAA . You are receiving this because you were mentioned.Message ID: @.***>

-- Robert M. Hanson Professor of Chemistry St. Olaf College Northfield, MN http://www.stolaf.edu/people/hansonr

If nature does not answer first what we want, it is better to take what answer we get.

-- Josiah Willard Gibbs, Lecture XXX, Monday, February 5, 1900

We stand on the homelands of the Wahpekute Band of the Dakota Nation. We honor with gratitude the people who have stewarded the land throughout the generations and their ongoing contributions to this region. We acknowledge the ongoing injustices that we have committed against the Dakota Nation, and we wish to interrupt this legacy, beginning with acts of healing and honest storytelling about this place.

BobHanson avatar Jul 24 '22 01:07 BobHanson

@BobHanson

Q: What's the use case here?

The use case is just to allow databases to include a SMILES representation of a structure on whatever SMILES format the database likes (edit: whatever format which is compatible with OpenSMILES specification v1.0) . The field isn't really meant to allow any useful form of "search" - the discussion in #368 seemed to conclude that there is too little standardization of SMILES to support such search in a consistent standardized way, except possibly via SMILES. Hence, there is a separate PR for adding SMILES in #398.

I think any thoughts on how a standardized SMILES-based search could work are very welcome in the discussion in #368 ( https://github.com/Materials-Consortia/OPTIMADE/issues/368 ). If I understand you correctly, you want the user to be able to give a SMILES string on "any format" and have the server internally handle the interpretation/conversions of that SMILES string to return entires for which the given SMILES is equivalent to the one given? Is there a benefit of doing the search this way, instead of formulating it as a SMARTS search where the full structure is the substructure?

(From the technical side I don't think the right way to express this kind of search is an expression that looks exactly like a string equality comparison. However, that is a technical discussion I think can be sorted out once we know more precisely how we would want a useful standardized SMILES search to work.)

rartino avatar Aug 08 '22 11:08 rartino

On Mon, Aug 8, 2022 at 6:27 AM Rickard Armiento @.***> wrote:

@BobHanson https://github.com/BobHanson

Q: What's the use case here?

The use case is just to allow databases to include a SMILES representation of a structure on whatever SMILES format the database likes. The field isn't really meant to allow any useful form of "search" - the discussion in #368 https://github.com/Materials-Consortia/OPTIMADE/issues/368 seemed to conclude that there is too little standardization of SMILES to support such search in a consistent standardized way, except possibly via SMILES. Hence, there is a separate PR for adding SMILES in #398 https://github.com/Materials-Consortia/OPTIMADE/pull/398.

I think any thoughts on how a standardized SMILES-based search could work are very welcome in the discussion in #398 https://github.com/Materials-Consortia/OPTIMADE/pull/398 ( #368 https://github.com/Materials-Consortia/OPTIMADE/issues/368 ). If I understand you correctly, you want the user to be able to give a SMILES string on "any format" and have the server internally handle the interpretation/conversions of that SMILES string to return entires for which the given SMILES is equivalent to the one given?

That's right. This is the standard procedure. What a service does is to run a very quick algorithm that transforms the queried SMILES to their canonical form (that is, using whatever software was used to create their saved SMILES strings). Then for them it is a straight string match. Very simple. Really nothing to it.

I query "CC1=CC=CC=C1O" and you convert that to "c1(C)ccccc1O" because that is how that is saved on your system. These are extremely simple algorithms -- just create the molecular graph from the SMILES and then generate the particular variant of SMILES from that that you need. It's just a quick pass through a library method.

Is there a benefit of doing the search this way, instead of formulating it as a SMARTS search where the full structure is the substructure?

Yes, certainly. SMARTS searching would be fantastic, but this is a more specialized capability that takes more sophisticated cheminformatics tools to do efficiently. So it is less likely that a service would have that capability.

Consider the following four queries to PubChem:

[image: image.png]

[image: image.png] [image: image.png] [image: image.png]

It would be much MUCH less useful if I had to already know that their canonicalization gave "CC1=CC=CC=C1O". How would I ever know that? Pretty sure they just did a quick conversion of those four SMILES variants to their "canonical" (meaning "the version our software creates") form and then, most probably just did a straight string match. Milliseconds as most.

(From the technical side I don't think the right way to express this kind

of search is an expression that looks exactly like a string equality comparison. However, that is a technical discussion I think can be sorted out once we know more precisely how we would want a useful standardized SMILES search to work.)

Sure. The key here is that there is no "standard" necessary. Every service chooses some particular toolkit to create SMILES strings. The "canonicalization" is with respect to the fact that, given a molecular graph, their software will always spit out the same SMILES string -- thus "locally" canonical. There is no such thing as "universally" canonical. Too many toolkits out there with their own idea of how to do this and what to call "aromatic" and how to represent that.

Bob

BobHanson avatar Aug 08 '22 11:08 BobHanson

I agree with Andrius. Just pointing out that the "." in SMILES may or may not indicate multiple components. It all depends upon if there are connecting links creating a bond between what is on the left of the period and what is on the right.

CCCO.O two components, one of them propanol, the other water

C1CCO.O1 one component, propane-1,3-diol

Bob

BobHanson avatar Aug 19 '22 17:08 BobHanson

Very simple. Really nothing to it. I query "CC1=CC=CC=C1O" and you convert that to "c1(C)ccccc1O" because that is how that is saved on your system. These are extremely simple algorithms -- just create the molecular graph from the SMILES and then generate the particular variant of SMILES from that that you need. It's just a quick pass through a library method.

I tend to disagree, conversion between different aromaticity depiction conventions is not straightforward. Richard L. Apodaca wrote a nice blogpost summarizing the issue and another one proposing an algorithm for conversion, which I tried to implement and gave up due to its complexity. There surely are libraries for this task, but there is no guarantee they correctly process various corner cases.

As for SMARTS, I am not aware of a single specification. Different libraries understand SMARTS queries quite differently. Time for OpenSMARTS? :sweat_smile:

Edit: There actually is a specification for OpenSMARTS!

merkys avatar Sep 23 '22 09:09 merkys

To me it seems like a tricky issue.

Just comparing SMILES as strings is of little use if the client and the server do not agree on canonical representation.

Reconstructing graph and then querying from it is doable but definitely more complicated than just passing a call to a database back-end.

InChI and InChI key are supposed to be more standard (and InChI keys must be queryable as strings, otherwise thye make no sense...); but in our hands InChI conversion also gives artefacts. What about "Inchified SMILES"?

IMHO, to be useful for string searches, the SMILES string MUST be canonicalised in a reliable way, and this canonicalisation MUST be standard in OPTIMADE.

sauliusg avatar Jan 27 '23 15:01 sauliusg

InChI and InChI key are supposed to be more standard (and InChI keys must be queryable as strings, otherwise thye make no sense...); but in our hands InChI conversion also gives artefacts. What about "Inchified SMILES"?

To my knowledge, Inchified SMILES is only implemented in Open Babel. Thus putting Inchified SMILES in the standard would likely push towards unified usage of Open Babel, and likely tie to one particular version of it.

In addition, I would personally like to avoid InChI, as recent versions of InChI library are not free software, at least not as understood by the Debian Free Software Guidelines.

merkys avatar Jan 27 '23 15:01 merkys

IMHO:

  1. SMILES are fundamentally valuable with or without canonicalization.
  2. To the extent that a SMILES is valuable depends upon the context.
  3. One context is structure matching.
  4. Another context is substructure searching.
  5. Another context is 2D- or 3D-structure creation from 1D representation (SMILES or InChI, name, etc.)

Agreed? Probably more options.

For structure matching, canonicalization is primarily valuable within a local context, because canonicalization only means that the particular algorithm used guarantees that regardless of how the structure's atoms and bonds are organized, the same string will be created -- provided that the same input options have been used (and there are many options!). And generally only within a local context do we know what exact algorithm was used and what options were used with it.

Furthermore, algorithms and implementations of algorithms are prone to multiple versioning. So one can never require any specifics regarding SMILES. Just to say, for example, "InChIfied SMILES" is not nearly enough. What version? What options? Would I somehow track down some old version and use it? Probably not.

It's a classic rat's nest.

So, I am not in favor of anything more than "smiles" here. It is a very narrow use-case where we need to know exactly what algorithm+options were used. If people feel that is necessary, then I suggest we follow the lead of PubChem. 1,2-dimethylbenzene here and allow for a second field that indicates at least something about the algorithm and options used:

InChI=1S/C8H10/c1-7-5-3-4-6-8(7)2/h3-6H,1-2H3 Computed by InChI 1.0.6 (PubChem release 2021.05.07)

CC1=CC=CC=C1C Computed by OEChem 2.3.0 (PubChem release 2021.05.07)

(Interesting that they do not indicate the options there -- Here we see a Kekulé form of the SMILES, but we could have also seen Cc1ccccc1C, so perhaps the "canonical" OEChem option requires that. Probably. Maybe. Or it was an option.)

Just to make the point, if we go to ChEMBL, alas, we find that for them, the "Canonical SMILES" is, in fact, Cc1ccccc1C.

ChEMBL does not specify what algorithm+options were used.

My personal preference is noncanonical Kekulé SMILES, which is the basis for for SMILES searching targets (actual molecules), rather than aromatic SMILES, which are more useful for the pattern used to find the target, since it covers multiple Kekulé varieties.

Saulius, I'm guessing that at COD, when I type in a SMILES string, you immediately canonicalize it to match your database, right? Or do you just consider everything entered to be a SMARTS search?

Bob

BobHanson avatar Jan 27 '23 16:01 BobHanson