smina-docking-benchmark
smina-docking-benchmark copied to clipboard
Invalid SMILES string in dataset
Hi, thanks for the great code!
I am trying to build my project on your work, using deep learning models to optimize docking scores.
However, I am having troubles in using your dataset, since some of your SMILES strings appear to be invalid, according to RDkit.
For example, in your file 5ht1b.csv, the SMILES string "[O-]NH+=C=CC=C(C1=2)N=CC2C3=CCNCC3" cannot be converted into a molecule using rdkit Chem.MolFromSmiles function. It says explicit valence for Nitrogen is greater than permitted.
Could you provide any guidance on resolving this issue? Many thanks in advance!
Hello!
Thanks for your interest and sorry for the late answer.
The SMILES were converted to .mol2 format for SMINA docking by OpenBabel. It is possible that some of the values OpenBabel considered as valid, are treated as invalid by RDKit. Currently, the only workaround is to manually filter out the dataset. E.g. CVAE training code filters out invalid molecules according to RDKit.
In the upcoming days I'll prepare a commit that will enable filtering out invalid SMILES' according to RDKit. I'll let you know when this is done.
I see! Thanks very much. I will look forward to your future works!