cdkr icon indicating copy to clipboard operation
cdkr copied to clipboard

Correct atom configuration prior fingerprint calculation

Open bachi55 opened this issue 6 years ago • 5 comments

Hello,

I was wondering about the correct usage of do.typing(), do.aromaticity() and do.isotopes() before calculating the fingerprints of a molecules parsed from its SMILES using parse.smiles().

Let me go through a few examples.

MACCS The documentation of get.fingerprints contains:

smiles <- c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1','C1CCC1CC(CN(C)(C))CC(=O)CC')
mols <- parse.smiles(smiles)
fps <- lapply(mols, get.fingerprint, type='maccs')

There is no information, that the aromaticity might need to be perceived first as otherwise some SMARTS are not properly matched (?). In the CDK tests for the MACCSFingerprinter at least there is atom-typing and aromaticity-detection done.

Pubchem For this fingerprinter the CDK tests indicate that implicit hydrogens should be converted to explicit ones, i.e. convert.implicit.to.explicit.

Klekota and Roth Here the CDK tests indicate, that no "additional" function, e.g. typing or aromaticity detection, needs to be applied.

So I could continue the list of different fingerprinters, which seem to expect different "modifications" done to the parsed molecular structure. Some fingerprinters seems to take care about it internally (within the class).

I would like to know, how others are dealing with this issue? Which transformations you perform, when you calculate different fingerprints (I could believe that problem continues for descriptors as well)? Am I overthinking the problem here? Can a "to much modifications" (I called do.typing if it is not needed, e.g.) harm my calculation, i.e. wrong fingerprints?

I believe the problem should somehow be solved in CDK, but how can we maybe in the rcdk side make the documentation more precise?

Best regards,

Eric

bachi55 avatar Jan 17 '19 11:01 bachi55