guacamol icon indicating copy to clipboard operation
guacamol copied to clipboard

isomeriSmiles= False

Open LivC193 opened this issue 3 years ago • 3 comments

From what I understand you set isomericSmiles = False in your preprocessing (filter_and_canonicalize function).

This means you don't take into account any isomeric information. Do you think this might be an issue, especially since isomers don't necessarily have similar chemical or physical properties?

LivC193 avatar Feb 17 '21 17:02 LivC193

Hey @LivC182, thanks for your interest in our code! Yes isomericSmiles is set to False although this name can be somewhat misleading since it only affects stereoisomers (see rdkit doc). So for example this wouldn't preclude the differentiation between pentane and isopentane since these have different connectivities.

However, your point is valid in the more specific case of stereochemistry. This is an area for future work. I can only speculate on the authors' reason for excluding stereochemistry:

(a) for simplification since @@ and @ encode stereochemistry relatively (different canonicalization can flip this sign), and therefore does not encode the absolute (R/S) stereochemistry of the molecule. Therefore its not good enough to merely encode single and double @ individually)

(b) because many scoring functions (such as Morgan fingerprints which are used for the rediscovery and similarity benchmarks) also ignore stereochemistry by default

JoshuaMeyers avatar Feb 22 '21 11:02 JoshuaMeyers

HI @JoshuaMeyers, I completely agree with all your points. I just wanted to make sure if isomers (except structural ones) are taken into account. If you have carbohydrate entries in your dataset not taking isomers into account might be problematic.

galactose_smiles = ''C([C@@H]1[C@@H]([C@@H]([C@H]([C@@H](O1)O)O)O)O)O'' glucose_smiles="C([C@@H]1[C@H]([C@@H]([C@H]([C@H](O1)O)O)O)O)O'" gal = Chem.MolFromSmiles(galactose_smiles) glc = Chem.MolFromSmiles(glucose_smiles) print(Chem.MolToSmiles(gal, isomericSmiles=False)) 'OCC1OC(O)C(O)C(O)C1O' print(Chem.MolToSmiles(glc, isomericSmiles=False)) 'OCC1OC(O)C(O)C(O)C1O'

LivC193 avatar Feb 22 '21 11:02 LivC193

Yes this is true. In the GuacaMol v1 SMILES training dataset, there are in fact 45 occurrences of the SMILES you give as an example (as substrings of larger molecules). Thanks for highlighting this case.

JoshuaMeyers avatar Feb 22 '21 12:02 JoshuaMeyers