guacamol
guacamol copied to clipboard
isomeriSmiles= False
From what I understand you set isomericSmiles = False in your preprocessing (filter_and_canonicalize
function).
This means you don't take into account any isomeric information. Do you think this might be an issue, especially since isomers don't necessarily have similar chemical or physical properties?
Hey @LivC182, thanks for your interest in our code! Yes isomericSmiles is set to False
although this name can be somewhat misleading since it only affects stereoisomers (see rdkit doc). So for example this wouldn't preclude the differentiation between pentane and isopentane since these have different connectivities.
However, your point is valid in the more specific case of stereochemistry. This is an area for future work. I can only speculate on the authors' reason for excluding stereochemistry:
(a) for simplification since @@
and @
encode stereochemistry relatively (different canonicalization can flip this sign), and therefore does not encode the absolute (R/S) stereochemistry of the molecule. Therefore its not good enough to merely encode single and double @
individually)
(b) because many scoring functions (such as Morgan fingerprints which are used for the rediscovery and similarity benchmarks) also ignore stereochemistry by default
HI @JoshuaMeyers, I completely agree with all your points. I just wanted to make sure if isomers (except structural ones) are taken into account. If you have carbohydrate entries in your dataset not taking isomers into account might be problematic.
galactose_smiles = ''C([C@@H]1[C@@H]([C@@H]([C@H]([C@@H](O1)O)O)O)O)O''
glucose_smiles="C([C@@H]1[C@H]([C@@H]([C@H]([C@H](O1)O)O)O)O)O'"
gal = Chem.MolFromSmiles(galactose_smiles)
glc = Chem.MolFromSmiles(glucose_smiles)
print(Chem.MolToSmiles(gal, isomericSmiles=False))
'OCC1OC(O)C(O)C(O)C1O'
print(Chem.MolToSmiles(glc, isomericSmiles=False))
'OCC1OC(O)C(O)C(O)C1O'
Yes this is true. In the GuacaMol v1 SMILES training dataset, there are in fact 45 occurrences of the SMILES you give as an example (as substrings of larger molecules). Thanks for highlighting this case.