Bug in SMARTS string for CH-acidic and/or CH-acidic_strong
Hello! I am trying to perform some substructure searches on molecules using openbabel 3.1.1 and I am having trouble with a particular SMARTS string:
CH-acidic_strong:
[CX4;!$([H0]);!$([!#6;!$([P,S]=O);!$(N(~O)~O)])]([$([CX3]=[O,N,S]),$(C#[N]),$([S,P]=[OX1]),$([NX3]=O),$([NX3+](=O)[O-]);!$(*[S,O,N;H1,H2]);!$([*+0][S,O;X1-])])[$([CX3]=[O,N,S]),$(C#[N]),$([S,P]=[OX1]),$([NX3]=O),$([NX3+](=O)[O-]);!$(*[S,O,N;H1,H2]);!$([*+0][S,O;X1-])]
The 'SMARTS_InteLigand.txt' file states:
CH-acidic: [$([CX4;!$([H0]);!$(C[!#6;!$([P,S]=O);!$(N(~O)~O)])][$([CX3]=[O,N,S]),$(C#[N]),$([S,P]=[OX1]),$([NX3]=O),$([NX3+](=O)[O-]);!$(*[S,O,N;H1,H2]);!$([*+0][S,O;X1-])]),$([CX4;!$([H0])]1[CX3]=[CX3][CX3]=[CX3]1)]
*C-H alpha to carbony, nitro or similar, C is not double-bonded, only C, H, S,P=O and nitro substituents allowed.
*pentadiene is included. acids, their salts, prim./sec. amides, and imides are excluded.
*hits also CH-acidic_strong
CH-acidic_strong: [CX4;!$([H0]);!$([!#6;!$([P,S]=O);!$(N(~O)~O)])]([$([CX3]=[O,N,S]),$(C#[N]),$([S,P]=[OX1]),$([NX3]=O),$([NX3+](=O)[O-]);!$(*[S,O,N;H1,H2]);!$([*+0][S,O;X1-])])[$([CX3]=[O,N,S]),$(C#[N]),$([S,P]=[OX1]),$([NX3]=O),$([NX3+](=O)[O-]);!$(*[S,O,N;H1,H2]);!$([*+0][S,O;X1-])]
*same as above (without pentadiene), but carbonyl or similar on two or three sides
I know, they are very long SMARTS strings, but actually, my problem with it is positioned in the first part of it. For 'CH-acidic' it is:
[CX4;!$([H0]);!$(C[!#6;!$([P,S]=O);!$(N(~O)~O)])]
But for 'CH-acidic_strong' it is:
[CX4;!$([H0]);!$([!#6;!$([P,S]=O);!$(N(~O)~O)])]
According to 'SMARTS_InteLigand.txt' they should be the same, but they are not. And that's why I got very confused. To my understanding, the 'CH-acidic' SMARTS means:
A C-atom with a total of 4 connections, that must not have 0 hydrogens attached and that must be bound to another C-atom of some kind ([#6]) that is not bound to P=O, S=O or N(~O)~O.
On the other hand, the 'CH-acidic_strong' SMARTS means: A C-atom with a total of 4 connections, that must not have 0 hydrogens attached and that must be a C-atom of some kind that is not bound to P=O, S=O or N(~O)~O.
So, in 'CH-acidic' the CX4 is bound to another C-atom ([#6]) and in 'CH-acidic_strong' CX4 itself must not be bound to P=O...
At least, that is how I understood it. It might very much be possible that I just don't understand it correctly. And in that case, I would be very very happy, if anyone could explain my mistake to me.
Because the reason why I even mention it is, that when I use the 'CH-acidic_strong' SMARTS as is, I can find the substructure in the molecule described by the SMILES
'S(c1c(C(C)(C)C)cc(CO)c(C)c1)C1C(=O)O[C@@](C(C)C)(CCc2ccc(N)cc2)CC1=O'
(see picture down below).
But when I exchange the first part of the SMARTS string with the one of 'CH-acidic', I don't get a hit in the molecule any longer.
And I just don't understand why. And I also don't know, which SMARTS is correct. I feel like, if I don't understand these SMARTS strings correctly, I don't understand the SMARTS syntax correctly in general.
I hope, someone can help me out with this problem and please forgive me, if I just did a very simple or obvious mistake. I am doing my very best to teach this stuff to myself, since I need it for my work, but don't have anyone to ask about it.
Thanks a lot!
Each time one wants to report an issue about openbabel here on GitHub, the interface provides you with a template. This is to help you to organize your observation(s), and others to identify the possible cause; typically leading to an improvement how the program is used or/and an improvement of the program.
You still can edit your question with the bits and bolts the template provides; perhaps especially to organize a report in sections (initiated by a head line following the ##):
## Environment Information
Open Babel version:
Operating system and version:
## Expected Behavior
<!-- Describe the intended output or what you expected to see. -->
## Actual Behavior
<!--- If describing a bug, tell us what happens instead of the expected behavior -->
<!--- If suggesting a change/improvement, explain the difference from current behavior -->
## Steps to Reproduce
<!--
If the problem occurs with a particular file, please either upload and attach the file or include a link here - this greatly improves our ability to test your problem.
Please include screenshots or text output if they help illustrate a behavior.
In addition, the interface allows to discern running text on one hand, from snippets of code / output back to the CLI on the other hand by markup as a code block: add three backticks in a line prior, and three backticks in a line following the section. Do not confuse backticks with single quotes.
Because some characters might have a particular meaning to (GitHub flavored) markdown syntax used here, it equally is safer to enclose SMILES and SMARTS in backticks, too. This equally eases to copy-paste them from here to a local instance of running openbabel. To illustrate the above, see e.g.
obabel -:"C#Cc1ccccc1" -h --gen3d -O phenylacetylene.sdf
about a snippet of source code enclosed in a fenced code block, and
CX4;!$([H0]);!$([!#6;!$([P,S]=O);!$(N(~O)~O)])[$([CX3]=[O,N,S]),$(C#[N]),$([S,P]=[OX1]),$([NX3]=O),$(NX3+[O-]);!$([S,O,N;H1,H2]);!$([+0][S,O;X1-])]
about a longer string. For a short keyword, single backticks are fine, e.g. print.