openbabel icon indicating copy to clipboard operation
openbabel copied to clipboard

Bug in SMARTS string for CH-acidic and/or CH-acidic_strong

Open ghost opened this issue 2 years ago • 1 comments

Hello! I am trying to perform some substructure searches on molecules using openbabel 3.1.1 and I am having trouble with a particular SMARTS string:

CH-acidic_strong:

[CX4;!$([H0]);!$([!#6;!$([P,S]=O);!$(N(~O)~O)])]([$([CX3]=[O,N,S]),$(C#[N]),$([S,P]=[OX1]),$([NX3]=O),$([NX3+](=O)[O-]);!$(*[S,O,N;H1,H2]);!$([*+0][S,O;X1-])])[$([CX3]=[O,N,S]),$(C#[N]),$([S,P]=[OX1]),$([NX3]=O),$([NX3+](=O)[O-]);!$(*[S,O,N;H1,H2]);!$([*+0][S,O;X1-])]

The 'SMARTS_InteLigand.txt' file states:

CH-acidic: [$([CX4;!$([H0]);!$(C[!#6;!$([P,S]=O);!$(N(~O)~O)])][$([CX3]=[O,N,S]),$(C#[N]),$([S,P]=[OX1]),$([NX3]=O),$([NX3+](=O)[O-]);!$(*[S,O,N;H1,H2]);!$([*+0][S,O;X1-])]),$([CX4;!$([H0])]1[CX3]=[CX3][CX3]=[CX3]1)]
 *C-H alpha to carbony, nitro or similar, C is not double-bonded, only C, H, S,P=O and nitro substituents allowed. 
 *pentadiene is included. acids, their salts, prim./sec. amides, and imides are excluded. 
 *hits also CH-acidic_strong

CH-acidic_strong: [CX4;!$([H0]);!$([!#6;!$([P,S]=O);!$(N(~O)~O)])]([$([CX3]=[O,N,S]),$(C#[N]),$([S,P]=[OX1]),$([NX3]=O),$([NX3+](=O)[O-]);!$(*[S,O,N;H1,H2]);!$([*+0][S,O;X1-])])[$([CX3]=[O,N,S]),$(C#[N]),$([S,P]=[OX1]),$([NX3]=O),$([NX3+](=O)[O-]);!$(*[S,O,N;H1,H2]);!$([*+0][S,O;X1-])]
 *same as above (without pentadiene), but carbonyl or similar on two or three sides

I know, they are very long SMARTS strings, but actually, my problem with it is positioned in the first part of it. For 'CH-acidic' it is:

[CX4;!$([H0]);!$(C[!#6;!$([P,S]=O);!$(N(~O)~O)])]

But for 'CH-acidic_strong' it is:

[CX4;!$([H0]);!$([!#6;!$([P,S]=O);!$(N(~O)~O)])]

According to 'SMARTS_InteLigand.txt' they should be the same, but they are not. And that's why I got very confused. To my understanding, the 'CH-acidic' SMARTS means: A C-atom with a total of 4 connections, that must not have 0 hydrogens attached and that must be bound to another C-atom of some kind ([#6]) that is not bound to P=O, S=O or N(~O)~O.

On the other hand, the 'CH-acidic_strong' SMARTS means: A C-atom with a total of 4 connections, that must not have 0 hydrogens attached and that must be a C-atom of some kind that is not bound to P=O, S=O or N(~O)~O.

So, in 'CH-acidic' the CX4 is bound to another C-atom ([#6]) and in 'CH-acidic_strong' CX4 itself must not be bound to P=O... At least, that is how I understood it. It might very much be possible that I just don't understand it correctly. And in that case, I would be very very happy, if anyone could explain my mistake to me. Because the reason why I even mention it is, that when I use the 'CH-acidic_strong' SMARTS as is, I can find the substructure in the molecule described by the SMILES

'S(c1c(C(C)(C)C)cc(CO)c(C)c1)C1C(=O)O[C@@](C(C)C)(CCc2ccc(N)cc2)CC1=O' 
(see picture down below). 

But when I exchange the first part of the SMARTS string with the one of 'CH-acidic', I don't get a hit in the molecule any longer.

And I just don't understand why. And I also don't know, which SMARTS is correct. I feel like, if I don't understand these SMARTS strings correctly, I don't understand the SMARTS syntax correctly in general.

I hope, someone can help me out with this problem and please forgive me, if I just did a very simple or obvious mistake. I am doing my very best to teach this stuff to myself, since I need it for my work, but don't have anyone to ask about it.

Thanks a lot!

molecule_hit

ghost avatar Jun 30 '23 09:06 ghost

Each time one wants to report an issue about openbabel here on GitHub, the interface provides you with a template. This is to help you to organize your observation(s), and others to identify the possible cause; typically leading to an improvement how the program is used or/and an improvement of the program.

You still can edit your question with the bits and bolts the template provides; perhaps especially to organize a report in sections (initiated by a head line following the ##):

## Environment Information
Open Babel version:
Operating system and version:

## Expected Behavior
<!-- Describe the intended output or what you expected to see. -->

## Actual Behavior
<!--- If describing a bug, tell us what happens instead of the expected behavior -->
<!--- If suggesting a change/improvement, explain the difference from current behavior -->

## Steps to Reproduce
<!--
If the problem occurs with a particular file, please either upload and attach the file or include a link here - this greatly improves our ability to test your problem.
Please include screenshots or text output if they help illustrate a behavior.

In addition, the interface allows to discern running text on one hand, from snippets of code / output back to the CLI on the other hand by markup as a code block: add three backticks in a line prior, and three backticks in a line following the section. Do not confuse backticks with single quotes.

Because some characters might have a particular meaning to (GitHub flavored) markdown syntax used here, it equally is safer to enclose SMILES and SMARTS in backticks, too. This equally eases to copy-paste them from here to a local instance of running openbabel. To illustrate the above, see e.g.

obabel -:"C#Cc1ccccc1" -h --gen3d -O phenylacetylene.sdf

about a snippet of source code enclosed in a fenced code block, and

CX4;!$([H0]);!$([!#6;!$([P,S]=O);!$(N(~O)~O)])[$([CX3]=[O,N,S]),$(C#[N]),$([S,P]=[OX1]),$([NX3]=O),$(NX3+[O-]);!$([S,O,N;H1,H2]);!$([+0][S,O;X1-])]

about a longer string. For a short keyword, single backticks are fine, e.g. print.

nbehrnd avatar Jun 30 '23 11:06 nbehrnd