opsin icon indicating copy to clipboard operation
opsin copied to clipboard

Aromatic SMILES

Open merkys opened this issue 3 years ago • 3 comments

As of v2.5.0, OPSIN outputs kekulized SMILES (benzene is translated to C1=CC=CC=C1). If the information about the ring aromaticity is known to OPSIN, output of aromatic SMILES (benzene translated to c1ccccc1) would be very beneficial, as algorithms to aromatize kekulized structures are not straightforward. It would be best to have both output forms available, controllable via a command line option.

merkys avatar Jan 13 '21 11:01 merkys

OPSIN does internally have a concept of an atom having "maximum number of non-cumulative double bonds" which does roughly correspond to aromaticity in SMILES. but there are differences. In OPSIN's internal format it's not incorrect to represent pyrrole as n1cccc1,which is invalid in SMILES*. My understanding is that most toolkits do have a method for percieving aromaticity so I'm not that clear on the use case. In your proposal would you expect benzene and cyclohexa-1,3,5-triene to have different SMILES?

* In this case OPSIN does actually use [nH]1cccc1 with the hydrogen on the N being interpreted as a hint that if unspecified pyrrole should be assumed to be 1H-pyrrole

dan2097 avatar Jan 15 '21 15:01 dan2097

My understanding is that most toolkits do have a method for percieving aromaticity so I'm not that clear on the use case.

I am doing analysis of SMILES without toolkits. Aromaticity perception from scratch requires identification of rings, and this is already quite cumbersome and computationally intensive.

By the way, OpenSMILES specification seems to recommend the aromatic form:

The Kekule form is always acceptable for SMILES input. For output, the aromatic form (using lowercase letters) is preferred. The lowercase symbols eliminate the arbitrary choice of how to assign the single and double bonds, and provide a normalized form that more accurately reflects the electronic configuration.

It also discusses that aromatic form is preferable in matching via SMARTS.

In your proposal would you expect benzene and cyclohexa-1,3,5-triene to have different SMILES?

Good point. Most likely not, as cyclohexa-1,3,5-triene is aromatic, so I expect both to be c1ccccc1.

merkys avatar Jan 22 '21 15:01 merkys

@merkys You could just use rdkit to do aromaticity perception. Usually it is quite fast. I have been using it. Or what is the reason not to use post-procerssing through a toolkit?

simonmb avatar Nov 02 '22 06:11 simonmb