OPTIMADE icon indicating copy to clipboard operation
OPTIMADE copied to clipboard

Add SMILES property

Open JPBergsma opened this issue 3 years ago • 59 comments

Do we want to allow the use of smiles string in the field chemical_formula_descriptive ? The SMILES notation for molecular formulas uses '#' and '$' to indicate triple and quadruple bonds, the characters '/' and '' to indicate whether the bonds are in the cis or trans orientation and '@' and '@@' to differentiate enantiomers. Finally, ring numbers with more than one digit have to be preceded by a '%' sign.
It, therefore, seems reasonable to me to add these to the allowed characters for the chemical_formula_descriptive field.

Or do you think we should add a separate SMILES field instead?

JPBergsma avatar Jul 05 '21 13:07 JPBergsma

Or do you think we should add a separate SMILES field instead?

I would suggest so. chemical_formula_descriptive has its own purpose and semantics, and they should not change.

merkys avatar Jul 06 '21 09:07 merkys

@JPBergsma the topic of SMILES have come up a few times and a standardization for SMILES use in OPTIMADE would likely be very useful. If you are familiar with SMILES usage, could you perhaps describe a few "search scenarios" of SMILES data? E.g., what would you be searching for? How do you envision such a search could be expressed, etc.?

rartino avatar Jul 06 '21 10:07 rartino

Sorry, I did not read the specification for chemical_formula_descriptive well enough the first time and I overlooked that it is already defined by the IUPAC's Nomenclature. I, therefore, had already closed the issue but unfortunately, I did not have sufficient privileges to remove it.

It would indeed be better to add a separate field for the SMILES string, although we could also think about other ways to add topological information, as smiles strings cannot be compared directly.

JPBergsma avatar Jul 06 '21 10:07 JPBergsma

(I took the liberty of editing your issue title to match - feel free to adjust it)

rartino avatar Jul 06 '21 10:07 rartino

First of all, defining the topology of a molecule allows you to distinguish between molecules with the same elemental composition but a different structure. Perhaps the current IUPAC definition is also able to do so, but via the link in optimade.rst https://www.qmul.ac.uk/sbcs/iupac/bibliog/blue.html I only found information about how to name chemical compounds and not how to write the structural formula. (IUPAC did define the InChI format which does contain the molecular structure, but that is different from the example fields in OPTIMADE.)

Ideally, having the structural data of a molecule would also allow you to find molecules with a mostly similar structure but some small differences. For example, a structure where a hydrogen atom has been replaced by a methyl group or a bromine atom has been replaced by a chlorine atom. While this would be quite useful, it may be difficult to implement such a search.

I am not sure whether SMILES is the best option for this. It has the advantage that the strings are relatively human-readable but multiple SMILES strings can encode for the same molecule. So you first have to convert the string to a structure before you know whether they are identical, or you have to agree on which algorithm to use to generate SMILES strings.

There are other ways to store the structure of a molecule, like InChI, and another option would be to use a connectivity matrix.

JPBergsma avatar Jul 06 '21 14:07 JPBergsma

During OMDI I talked with someone from the Ocelot database. Their database has crystal structures of organic molecules. They use SMILES strings to search to select structures as one structure can have many names and a simple structural formula is not descriptive enough. So I think there would definitely be a use for a SMILES field within Optimade. In the original SMILES string, there could be multiple strings encoding the same molecule. Therefore they first convert the string to a structure and then convert it back to a smiles string with a known algorithm so the SMILES strings are guaranteed to be the same. They also match chemical groups, for example when I searched for benzene, they also returned molecules containing a benzene ring. They have a git reposit, so perhaps we could reuse some of their code to implement this in the Optimade python tools.

JPBergsma avatar Oct 17 '21 15:10 JPBergsma

I support standardizing a separate property for SMILES. However, there are some issues related both to its definition and usability.

  1. There is a bunch of competing SMILES specifications. I like the OpenSMILES as it is quite well-defined, albeit somewhat limited and unmaintained. Competing specifications mean that different software suites usually support one or another specification, but usually without clearly stating which one.
  2. The same molecule can yield different SMILES. Canonicalization algorithms exist, but again there are many, without a prevalent one.
  3. SMILES matching is not string matching. While identical SMILES almost always mean identical molecules, this is pretty much the only comparison one can do with plain strings. There are tools like Mychem which implement substructure search using SMILES strings in MySQL, but the general SMILES comparison usually boils down to subgraph isomorphism. Fingerprinting techniques are a viable alternative.
  4. SMILES are directed mostly at organics. Therefore, compounds beyond organics are not trivial to represent, resulting in the need for additional conventions on representing them. We have contributed to an article about that, Quirós et al. 2018.

InChI is an alternative representation, however, it does not solve the matching problem. Moreover, it has licensing issues impeding its convenient usage.

merkys avatar Oct 18 '21 10:10 merkys

There is also the question how we handle this type of extension into string-like complex properties in the OPTIMADE filter language (and otherwise in our type system). Far back I wrote up my thoughts on this here: https://github.com/Materials-Consortia/OPTIMADE/issues/157#issuecomment-554686285

But, in short, we probably need to have some way to tell a normal string and a smiles string apart since they will have different comparison semantics.

rartino avatar Oct 18 '21 11:10 rartino

@Merkys

There is a bunch of competing SMILES specifications. I like the OpenSMILES as it is quite well-defined, albeit somewhat limited and unmaintained. Competing specifications mean that different software suites usually support one or another specification, but usually without clearly stating which one.

1 The OpenSmiles standard is definitively an option. It seems practically the same as the SMILES definition on the Daylight website so if necessary we could switch. Ideally, we would also use the SMARTS extension, which is specifically focused on querying structures, although it is not included in the OpenSmiles standard.

The same molecule can yield different SMILES. Canonicalization algorithms exist, but again there are many, without a prevalent one.

2 Either the server would have to canonicalize the input from the client or we would have to agree on a canonicalization algorithm that should be used by all clients and servers. At the moment I prefer canonicalization by the server as this does not put canonicalization requirements on the client and the server would need to do some processing anyway to handle queries using SMARTS. Internally the server may also store structure information in a different format than SMILES so it would need to do a conversion anyway. Another question would be whether we want to canonicalize the output.

SMILES matching is not string matching. While identical SMILES almost always mean identical molecules, this is pretty much the only comparison one can do with plain strings. There are tools like Mychem which implement substructure search using SMILES strings in MySQL, but the general SMILES comparison usually boils down to subgraph isomorphism. Fingerprinting techniques are a viable alternative.

3 I think it will indeed be necessary to generate a molecular graph. Although a preselection could be made using fingerprinting, for example, by looking at the atom composition of the searched fragment, or by comparing which common structural elements are present.
This way the full structures would only need to be compared for a relatively small number of structures.

SMILES are directed mostly at organics. Therefore, compounds beyond organics are not trivial to represent, resulting in the need for additional conventions on representing them. We have contributed to an article about that, Quirós et al. 2018.

4 At first I was thinking about limiting the requirement for SMILES structures to organic compounds, but after reading your article we could perhaps expand The SMILES definition to a broader range of compounds. In that case, we should formalize the method further than is currently described in the article. There may still be some arbitrariness with describing the atomistic structures though, as some arbitrary cut-off point has to be chosen for defining a bond.

InChI is an alternative representation, however, it does not solve the matching problem. Moreover, it has licensing issues impeding its convenient usage.

5 It seems that the discussion about the InChI licensing issue, you refer to, is still ongoing so perhaps it will be resolved. I do not think using InChI for our database would go against the intention of the InChI Trust.

Standard InChI has the limitation that tautomers have the same InChI code. In a laboratory setting, it is usually not possible to separate the tautomers so this would not be a problem. But in computational chemistry, the timescales are usually so short that no conversion takes place. There is an extension for this so I think we should implement it if we would want to use InChI. That way each InChI should belong to exactly one structure. Personally, I find InChI less intuitive and human-readable than SMILES, so simply typing in an InChI code would be more difficult than with SMILES.

A final option would be to use a molecular graph for searching.

@rartino

Unless we decide on a canonicalization algorithm, the SMILES field should indeed not have the string type as a direct comparison of uncanonicalized SMILES strings is not possible.

JPBergsma avatar Oct 21 '21 16:10 JPBergsma

(For brevity, I am not citing and explicitly responding to @JPBergsma sentences with which I completely agree)

Ideally, we would also use the SMARTS extension, which is specifically focused on querying structures, although it is not included in the OpenSmiles standard.

This can already be implemented by using custom extension endpoint mechanism.

2 Either the server would have to canonicalize the input from the client or we would have to agree on a canonicalization algorithm that should be used by all clients and servers. At the moment I prefer canonicalization by the server as this does not put canonicalization requirements on the client

Yes, this makes sense.

and the server would need to do some processing anyway to handle queries using SMARTS.

Not necessarily. The server, for example, may just pass user input to Open Babel which either reconstructs molecular graphs or does fingerprint matching.

Another question would be whether we want to canonicalize the output.

Preferably yes.

4 At first I was thinking about limiting the requirement for SMILES structures to organic compounds, but after reading your article we could perhaps expand The SMILES definition to a broader range of compounds. In that case, we should formalize the method further than is currently described in the article. There may still be some arbitrariness with describing the atomistic structures though, as some arbitrary cut-off point has to be chosen for defining a bond.

This would be nice, but again, all providers should use conventions as similar as possible.

5 It seems that the discussion about the InChI licensing issue, you refer to, is still ongoing so perhaps it will be resolved. I do not think using InChI for our database would go against the intention of the InChI Trust.

Strictly speaking, this is true only if providers manage to use InChI library without modifying its code.

merkys avatar Oct 22 '21 12:10 merkys

Ideally, we would also use the SMARTS extension, which is specifically focused on querying structures, although it is not included in the OpenSmiles standard.

This can already be implemented by using custom extension endpoint mechanism.

I am not sure what you mean with custom extension endpoint mechanism. There is a custom extension endpoint in the Optimade standard, but I do not see why that would be relevant here. You would want to use the SMARTS/SMILES to find particular structures. Creating a separate endpoint to do this seems cumbersome. You would also want to standardize the way this works across multiple databases. Which would be difficult if each database would create a custom endpoint.

the server would need to do some processing anyway to handle queries using SMARTS. Not necessarily. The server, for example, may just pass user input to Open Babel which either reconstructs molecular graphs or does fingerprint matching.

In that case the SMILES string would still be processed on the server(as in the physical computer that deals with the request.)

JPBergsma avatar Oct 26 '21 20:10 JPBergsma

Ideally, we would also use the SMARTS extension, which is specifically focused on querying structures, although it is not included in the OpenSmiles standard.

This can already be implemented by using custom extension endpoint mechanism.

I am not sure what you mean with custom extension endpoint mechanism. There is a custom extension endpoint in the Optimade standard, but I do not see why that would be relevant here. You would want to use the SMARTS/SMILES to find particular structures. Creating a separate endpoint to do this seems cumbersome. You would also want to standardize the way this works across multiple databases. Which would be difficult if each database would create a custom endpoint.

Sorry, I misparsed the term "extension".

I believe the SMARTS were originally described by Daylight. I am not sure about the state of other parallel SMARTS specifications, though.

the server would need to do some processing anyway to handle queries using SMARTS. Not necessarily. The server, for example, may just pass user input to Open Babel which either reconstructs molecular graphs or does fingerprint matching.

In that case the SMILES string would still be processed on the server(as in the physical computer that deals with the request.)

Yes, that is true.

merkys avatar Nov 26 '21 14:11 merkys

Looking back at my discussion checklist, I think we at least agree on using OpenSMILES. However, other issues still need more discussion. My suggestions to speed up the introduction of SMILES property would be the following:

  1. Server-provided SMILES need not to be canonical. Since there are many canonicalization methods and we probably cannot select one from them all, servers should just provide any SMILES representation of a structure. Then it is up to client to canonicalize them or not.
  2. Comparisons of SMILES with other SMILES or strings must not be supported, as well as querying. We may introduce this support later.

This would make the SMILES property a descriptive one. Thus, the client will be able to retrieve SMILES values alongside other structural data, but would not be able to query on them.

For dealing with inorganics I could propose adhering to Quirós et al. 2018 (disclaimer: I am one of the authors), but this would not be convenient for providers using their own conventions, or producing SMILES by Open Babel or some other software.

merkys avatar Nov 26 '21 15:11 merkys

I agree on point 1, that databases are allowed to use their own canonicalization method.

Part of the reason to implement this though is to make it easier to search for organic molecules, as these can have the same chemical formula. For that to work, it should be possible to search for SMILES strings. This should not be that difficult to implement. The database provider can turn the SMILES string of the query into a structure and turn it back into a smiles string with the canonicalization method of choice. The generated SMILES string can then be used for a simple string comparison with the SMILES fields in the database. Searching for fragments can still be added later on if necessary.

Quirós et al. 2018 could indeed be useful for describing metal complexes and such, as far as that they are not covered by the OpenSMILES standard.

JPBergsma avatar Dec 02 '21 17:12 JPBergsma

Aren't we landing in that we should just standardize a SMILES field to be a normal OPTIMADE String which is specified to contain an OpenSMILES representation of the implementer's choice (much like chemical_formula_descriptive, which had similar normalization issues with competing standards), and then put the requirement on MUST or SHOULD level that all partial string matching filter operators are supported?

(I realize it was said above that it cannot be a String because uncanonicalized SMILES "cannot be compared", but, the same issue technically holds for chemical_formula_descriptive and we were ok with that...)

The database provider can turn the SMILES string of the query into a structure and turn it back into a smiles string with the canonicalization method of choice.

I'm not sure why you mean such conversions would be needed (?), but if so, then this query support can only be on MAY level since it goes far beyond what can be handled efficiently by a typical query layer.

rartino avatar Dec 03 '21 06:12 rartino

I agree that we can define SMILES as a regular OPTIMADE String with all string handling operations. Thus for the time being "O" != "[OH2]" is true as these strings are not equal, despite molecules with SMILES of O and [OH2] being actually the same.

So it seems we have consensus on the most of SMILES-related issues. Let us prepare a PR then? I have opened #392 from the consensus (IMO) we achieved here.

merkys avatar Dec 03 '21 09:12 merkys

If we define the SMILES field as a normal OPTIMADE string we should define the canonicalization method that should be used with OPTIMADE. Otherwise, it does not make sense to put the requirement on MUST or SHOULD level for the (partial) string matching filter operators, as one molecule can have multiple different SMILES strings.

One of the main reasons to implement the SMILES notation is to enable searching on molecular structures. Without this, sharing data on structures composed of individual molecules would be inefficient. More structures would need to be returned than needed, since you can only select on the chemical formula and many molecules can have the same chemical formula. I can imagine that for people who want to set up a database with molecular structures, not being able to search for molecules could be a reason to not use OPTIMADE.

 I'm not sure why you mean such conversions would be needed (?), but if so, then this query support can only be on MAY level since it goes far beyond what can be handled efficiently by a typical query layer.

The conversion would be needed if we do not agree on a canonicalization method. If you start generating the SMILES string from different atoms within a molecule, you would get a valid SMILES string for each starting atom, but they would all be different. Because of this, you can not do a simple string comparison to see if two SMILES strings refer to the same molecule. So you would first need to generate the structure from the SMILES string and then turn it back into a SMILES string with the same method that has been used to generate the SMILES strings in the database.

There are already python packages that can convert SMILES strings into structures and back. RDkit can do this, and it also guarantees the created SMILES string is canonicalized, i.e. you will always get the same string regardless of SMILES string you originally used.

A simple way to make your structures with SMILES strings searchable is to covert your SMILES into structures and then back into SMILES strings with RDkit. This way, you can be sure all strings have the same canonicalization method. If you do the same for any SMILES string that is entered as a search term. It is guaranteed that two structures are the same if the SMILES strings match and are different when they do not match. This means a simple string comparison, which most database backends should be able to do quickly, is sufficient to find identical molecules.

One issue that we have not yet discussed is how we are going to handle structures with multiple molecules. Within a normal Smiles string these molecules are separated by ".", This would however require partial string matching to find the molecules. I suspect that this is relatively inefficient for databases, so I think it would be better to implement this as a list.

JPBergsma avatar Dec 05 '21 12:12 JPBergsma

I agree that to implement reliable querying of exact structures we have to define canonicalization method. This will most likely boil down to choosing common software package to produce canonical SMILES for OPTIMADE output, be it RDKit, Open Babel or something else. In addition, if we want to support inorganics, all providers will have to select a common set of rules to describe them.

As for reliable partial molecular matching, IMO we will never get around with simple substring matching. Imagine for example patterns to match rings.

Here I would like to draw attention to the distinction between database querying and screening. The first one expects the database to perform entry selection, whereas the second one downloads whole database and performs entry selection locally. I do not believe it is feasible to push all the providers to implement exact querying mechanisms. Thus IMO it is better to provide descriptive data in some common format and let the users perform the screening. With OPTIMADE provisions to include only specific fields in the response, downloads should not be too large.

Thus I very much would want to avoid forcing all the providers to use the same canonicalization method. I am afraid that instead being a useful descriptive property, SMILES would be supported by only a few providers.

merkys avatar Dec 05 '21 13:12 merkys

One issue that we have not yet discussed is how we are going to handle structures with multiple molecules. Within a normal Smiles string these molecules are separated by ".", This would however require partial string matching to find the molecules. I suspect that this is relatively inefficient for databases, so I think it would be better to implement this as a list.

Right. I would prefer sticking to string, not list because of how I imagine SMILES property to be used (screening instead of querying). In addition to that, the only list member comparison operator for string is equality (i.e., smiles HAS "O", would match water molecules). Others (CONTAINS, STARTS WITH, ENDS WITH) are not supported even on grammar level.

merkys avatar Dec 05 '21 13:12 merkys

As for reliable partial molecular matching, IMO we will never get around with simple substring matching. Imagine for example patterns to match rings.

Indeed, matching substructures is much more complicated and beyond the scope of PR#392.

Here I would like to draw attention to the distinction between database querying and screening. The first one expects the database to perform entry selection, whereas the second one downloads whole database and performs entry selection locally. I do not believe it is feasible to push all the providers to implement exact querying mechanisms. Thus IMO it is better to provide descriptive data in some common format and let the users perform the screening. With OPTIMADE provisions to include only specific fields in the response, downloads should not be too large.

Screening would be less efficient for both the client and the server: The database would have to send the SMILES strings of many structures to the client. (based on the elements in the SMILES string/molecule, some preselection can be made) Then the client would have to convert all these SMILES strings to structures so that they can be compared with the molecular structure that the client is searching. Once the SMILES strings have been found that encode for the desired molecule, The client would have to send a query to the database for the records with these SMILES strings. And the database would, then, have to loop over all SMILES values to check which contain these SMILES strings, before returning the desired structures. This takes much more computing time than the method I suggested. I am therefore convinced that we should not force databases to use the screening method you described.

Right. I would prefer sticking to string, not list because of how I imagine SMILES property to be used (screening instead of querying). In addition to that, the only list member comparison operator for string is equality (i.e., smiles HAS "O", would match water molecules). Others (CONTAINS, STARTS WITH, ENDS WITH) are not supported even on grammar level.

There are not many useful substring queries you can do on SMILES strings. You could check whether triple and quadruple bonds or charges are present, but that's about it. So we would not lose that much by converting the field to a list. And it would off course also be possible to expand the queryability of strings in a list, although that's best left for a different PR.

JPBergsma avatar Dec 05 '21 19:12 JPBergsma

Screening would be less efficient for both the client and the server: The database would have to send the SMILES strings of many structures to the client. (based on the elements in the SMILES string/molecule, some preselection can be made) Then the client would have to convert all these SMILES strings to structures so that they can be compared with the molecular structure that the client is searching. Once the SMILES strings have been found that encode for the desired molecule, The client would have to send a query to the database for the records with these SMILES strings. And the database would, then, have to loop over all SMILES values to check which contain these SMILES strings, before returning the desired structures. This takes much more computing time than the method I suggested. I am therefore convinced that we should not force databases to use the screening method you described.

In my understanding screening is simpler. A generic screening workflow:

  1. Client retrieves all information required for screening;
  2. Client performs screening locally to find entry IDs of interest;
  3. Using entry IDs client retrieves full entry records from the database.

Thus for SMILES there is no need to query the database on SMILES values, ever. As for converting SMILES to structures locally, to perform the screening locally a client most likely will use RDKit or Open Babel or any other cheminformatics toolbox.

I agree that this is more computing time than just storing canonicalized SMILES in provider databases. However, all providers have to agree on the same canonicalization method and this has to be enforced on MUST level (otherwise it cannot be trusted). And I do not believe this is feasible.

As I have written before, there are many SMILES canonicalization methods. However, they are rarely well-defined. I am not in favor of writing "SMILES canonicalization MUST be done by RDKit" in the specification, because we will have to put down the specific RDKit version (other versions may change the canonicalization), even specific versions of its dependencies if we are interested in providing really reliable service. And this, I believe, opens yet another can of worms. So unless we find a well-defined SMILES canonicalization method supported by more than one (ideally, >2) cheminformatics toolboxes, I do not think we can enforce it.

There are not many useful substring queries you can do on SMILES strings. You could check whether triple and quadruple bonds or charges are present, but that's about it. So we would not lose that much by converting the field to a list. And it would off course also be possible to expand the queryability of strings in a list, although that's best left for a different PR.

I agree that substring comparisons are not very useful indeed. I have opened issue #393 to discuss the expansion of the queryability of strings in a list. But I would stick to single-string SMILES representation due to its simplicity unless we mandate strict canonicalization.

It would be great to hear the opinions of other developers interested in this property.

merkys avatar Dec 06 '21 08:12 merkys

It indeed seems a problem for the canonicalization approach, if there isn't any good standard to use for canonicalization.

But, I also find it quite abstract what kind of "high level searches" we are talking about here that are connected to the SMILES field specifically, as opposed to our other structural fields.

@JPBergsma could you try to come up with a few examples of "dream" searches that you envision possible if one does the on-the-fly conversion from SMILES to structure that you propose? Feel free to just improvise a filter syntax.

rartino avatar Dec 06 '21 14:12 rartino

But, I also find it quite abstract what kind of "high level searches" we are talking about here that are connected to the SMILES field specifically, as opposed to our other structural fields.

I understand that currently we are mostly talking about identical match operation. If the canonicalization becomes a MUST, then this reduces to simple string comparison (= and != operators).

Early in the discussion fuzzy matching was discussed. There is SMARTS query language which could be employed to search for substructures, for example:

  • smiles CONTAINS SMARTS "c1ccccc1" could be used to find structures having benzene rings (CONTAINS SMARTS is a "dream operator" here)
  • smiles SMARTS "c1ccccc1" could be used to find structures that are exactly benzene rings (SMARTS being "dream operator"). Not sure whether SMARTS language has provisions for exact match, though.

merkys avatar Dec 06 '21 15:12 merkys

  1. There is a bunch of competing SMILES specifications. I like the OpenSMILES as it is quite well-defined, albeit somewhat limited and unmaintained. Competing specifications mean that different software suites usually support one or another specification, but usually without clearly stating which one.

There is an interesting new development called Dialect. It is an attempt to fix and extend the commonly used SMILES standard. I am not suggesting to switch to it right away as it is in its early stages of development, just linking for reference.

merkys avatar Dec 26 '21 07:12 merkys

In my understanding screening is simpler. A generic screening workflow:

Client retrieves all information required for screening; Client performs screening locally to find entry IDs of interest; Using entry IDs client retrieves full entry records from the database.

That is also possible. It however does require sending more information (the ID's) to the client. If the SMILES field is indexed, it would not take extra time to use the SMILES strings instead.

I agree that this is more computing time than just storing canonicalized SMILES in provider databases. However, all providers have to agree on the same canonicalization method and this has to be enforced on MUST level (otherwise it cannot be trusted). And I do not believe this is feasible.

If we treat the SMILES string as a plain string, we would indeed need to agree upon a canonicalization method. This would be the most efficient. We could however let the database convert the SMILES string, that is entered in the search, to a SMILES string in the canonicalization format of the database. That way, the database provider would only have to compare strings, and there does not need to be agreement on the canonicalization method.

I quickly looked, but I could not find which exact method RDkit uses. It is a shame that there is no well adopted canonicalization method, even though a canonicalization method was already defined with the original SMILES standard.

Dialect is a nice initiative, but I am a bit worried that we would get just another standard that is not widely adopted. There is for example already SYBYL which is another SMILES derived format to specify chemical structures, it however does not define a canonicalization method. So it would not solve our problem.

@rartino As Merkys mentioned you mostly want to find structures with specific molecules, some libraries like rdkit will also generate tautomers for a structure. Finding molecules with a certain substructure would be great, but I do not think this can be done efficiently with the backends that are currently used(SQL, MongoDB, elastic search).

JPBergsma avatar Jan 02 '22 13:01 JPBergsma

In my understanding screening is simpler. A generic screening workflow:

Client retrieves all information required for screening; Client performs screening locally to find entry IDs of interest; Using entry IDs client retrieves full entry records from the database.

That is also possible. It however does require sending more information (the ID's) to the client. If the SMILES field is indexed, it would not take extra time to use the SMILES strings instead.

I do not see a problem in retrieving IDs. Most of the time clients will want IDs and versions/modification timestamps for provenance anyway.

I agree that this is more computing time than just storing canonicalized SMILES in provider databases. However, all providers have to agree on the same canonicalization method and this has to be enforced on MUST level (otherwise it cannot be trusted). And I do not believe this is feasible.

If we treat the SMILES string as a plain string, we would indeed need to agree upon a canonicalization method. This would be the most efficient. We could however let the database convert the SMILES string, that is entered in the search, to a SMILES string in the canonicalization format of the database. That way, the database provider would only have to compare strings, and there does not need to be agreement on the canonicalization method.

Right, but canonicalization methods will affect matching. A trivial example is aromatized vs. kekulized aromatic rings. If the provider does not canonicalize these, then kekulized input will only match kekulized molecules in the database.

I quickly looked, but I could not find which exact method RDkit uses. It is a shame that there is no well adopted canonicalization method, even though a canonicalization method was already defined with the original SMILES standard.

AFAIR, this method has many deficiencies. Maybe this is the reason it has not been adopted widely.

Dialect is a nice initiative, but I am a bit worried that we would get just another standard that is not widely adopted. There is for example already SYBYL which is another SMILES derived format to specify chemical structures, it however does not define a canonicalization method. So it would not solve our problem.

Sure, but I like the idea. Dialect does not seem to aim to reinvent a SMILES-like notation or extend it, but clarify the obscure parts which are often interpreted differently.

merkys avatar Jan 03 '22 08:01 merkys

Introduced _cod_smiles in the COD OPTIMADE implementation. It is a plain string, just as suggested in #392. String-based queries are not implemented yet, though.

merkys avatar Jan 03 '22 15:01 merkys

@JPBergsma

We could however let the database convert the SMILES string, that is entered in the search, to a SMILES string in the canonicalization format of the database. That way, the database provider would only have to compare strings, and there does not need to be agreement on the canonicalization method.

If we can do canonicalization, this is indeed the design to go for (to enable this kind of cheap on-the-fly translation + optimized backend query has been a guiding principle for other fields, which is the reason we do not enforce support for partial string matching on such fields...)

However, in absence of a good explicitly formulated canonicalization, I have trouble seeing a solution beyond the chemical_formula_descriptive approach where each database does what makes the most sense to them.

Nevertheless, if the dream is to query on substructures, maybe this can be done in another way than as a quasi-string-operation on a single SMILES field? Could we have something like an optional SMILES_substructures which is a list of all identifiable substructures? It could then be queried like: SMILES_substructures HAS "c1ccccc1".

(Since the implementation knows the specific (quasi-)canonicalization used by the backend, it may be able to translate this query to a partial string matching on the backend SMILES field.)

rartino avatar Jan 03 '22 23:01 rartino

@rartino

Nevertheless, if the dream is to query on substructures, maybe this can be done in another way than as a quasi-string-operation on a single SMILES field? Could we have something like an optional SMILES_substructures which is a list of all identifiable substructures? It could then be queried like: SMILES_substructures HAS "c1ccccc1".

The number of all possible substructures times all possible representations is just too large for anything but the most trivial molecules. Narrowing this set down to an arbitrary subset increases the risk of false-negatives, and this is something I very much would like to avoid.

merkys avatar Jan 04 '22 08:01 merkys

@merkys

The number of all possible substructures times all possible representations is just too large for anything but the most trivial molecules.

Indeed. My intent was not for the list to contain "all possible representations" but rather that we could find some standardization for substructures. I suppose you could argue that if there is no canonicalized form for the full SMILES, then there also is none for substructures. Nevertheless, maybe one could refer to some standard list/database of substructures and say something along the lines of ~ "substructures SHOULD only be listed if the are present in list X, and, if present, MUST use the precise SMILES in that list"?

On the other hand - I suppose we could make the same kind of canonicalization for the full SMILES formula? Not standardizing the full formula, but say that all substructures present in a list must be on the form in the list?

Narrowing this set down to an arbitrary subset increases the risk of false-negatives, and this is something I very much would like to avoid.

Well, given that the detection of substructures is subject to a possibly imperfect detection algorithm with a certain level of subjectivity in what is regarded as a bond, etc., I don't think it is technically possible to eliminate false-negatives. (But perhaps you mean that if I have identified substructure Y, then there should be no false-negative if you are also looking for that substructure.)

rartino avatar Jan 10 '22 10:01 rartino