smina-docking-benchmark Invalid SMILES string in dataset

Invalid SMILES string in dataset

Open sungsoo-ahn opened this issue 3 years ago • 2 comments

Hi, thanks for the great code!

I am trying to build my project on your work, using deep learning models to optimize docking scores.

However, I am having troubles in using your dataset, since some of your SMILES strings appear to be invalid, according to RDkit.

For example, in your file 5ht1b.csv, the SMILES string "[O-]NH+=C=CC=C(C1=2)N=CC2C3=CCNCC3" cannot be converted into a molecule using rdkit Chem.MolFromSmiles function. It says explicit valence for Nitrogen is greater than permitted.

Could you provide any guidance on resolving this issue? Many thanks in advance!

Aug 30 '21 14:08 sungsoo-ahn

Hello!

Thanks for your interest and sorry for the late answer.

The SMILES were converted to .mol2 format for SMINA docking by OpenBabel. It is possible that some of the values OpenBabel considered as valid, are treated as invalid by RDKit. Currently, the only workaround is to manually filter out the dataset. E.g. CVAE training code filters out invalid molecules according to RDKit.

In the upcoming days I'll prepare a commit that will enable filtering out invalid SMILES' according to RDKit. I'll let you know when this is done.

Sep 10 '21 19:09 cieplinski-tobiasz

I see! Thanks very much. I will look forward to your future works!

Sep 13 '21 08:09 sungsoo-ahn

smina-docking-benchmark smina-docking-benchmark copied to clipboard

Invalid SMILES string in dataset

smina-docking-benchmark
smina-docking-benchmark copied to clipboard