moleculenet
moleculenet copied to clipboard
Non-canonical smiles confound string-based classifiers
Running a string kernel classifier on the clintox dataset, I can obtain an AUROC of 0.96. When I canonicalize the smiles, my AUROC drops to 0.69. This implies that there is a bias in the smiles format between positive and negative examples that string-based classifiers can exploit to obtain unrealistically high performance, thereby tainting downstream benchmarks.
A solution to this would be to update the dataset to include only canonicalized smiles.
Oh wow, that's quite the find! Yes, this would definitely need to be fixed as we're overhauling MoleculeNet for the next v2 release. I'll mark this as a bug
@cyrusmaher do you have any insight as to what the bias in the SMILES is?
Starting to think about this anymore, I think we canonicalize smiles before computing descriptors in DeepChem which should handle this (but I'm not sure).
@cyrusmaher Would it be possible to provide a brief reproducing code snippet? That would help us figure out what's happening :)
One consideration : maybe a character frequency count between the raw and canonicalized forms? Maybe there's extra parentheses or aromatic operators added? (:)
@rbharath The easiest way to reproduce this will be to run a string model on ClinTox with and without smiles canonicalization. Here is an example for canonicalization:
from rdkit import Chem
canonical_smi = Chem.MolToSmiles(Chem.MolFromSmiles(smi), canonical=True)
Without bloating this with the helper code for the string kernel, etc., here's what I ran:
@gabegrand I'm not sure precisely, but delocalization, tautomers, salts, etc can all be handled differently in systematic ways
@rbharath It's worth considering that canonicalization would not entirely eliminate this bias, e.g. if one database is more likely to include charged species (assuming a different pH or preparation). You can see evidence for this in the different levels of "+" and "." characters between positive and negative examples in clintox:
Edit: it appears that much of this significance is driven by smiles that turn out to be duplicated once they're canonicalized.
Firstly, thanks to the DeepChem & MoleculeNet contributers, it is a great library and a great benchmark!
However, I think this issue really needs to be fixed, before people publish papers (if they haven't already). And potentially the clintox dataset should be dropped altogether.
I was reproducing the textcnn result on the clintox dataset. I was very pleased to reproduce the benchmark AUC of ~0.995!
However, when I examined the underlying dataset I found that there were severe biases in the dataset, which should be fixed in the overall benchmark for clintox. The benchmark shows textcnn wining by around 11%, which is very unlikely to be true.
I observed the following in my experiments.
- Training textcnn on smiles as given by deepchem dataloader: Train: 0.991, Val: 0.995, Test: 0.994
- Train on only the last and first 2 characters of the smiles: Train: 0.850, Val: 0.700, Test: 0.956
- Training on on RDKit canonical smiles: Train: 0.916, Val: 0.837, Test: 0.905
I think the most surprising thing was only using the first and last 2 characters of the smiles. You can still achieve a very high (apparently +7% on SOTA) test AUC of 0.956, however would you really trust such a classifier to detect toxic molecules?!
The third experiment about canon smiles agrees with the findings of @cyrusmaher
The model dc.models.TextCNNModel
seems to use the smiles in dataset.ids
which are not canonical. My suggestion would be to canonicalise them in the dataset loader by default for all datasets.
Thanks for the detailed analysis! We're working towards a MoleculeNet 2.0 paper. We will update recommendations and benchmark analysis for Clintox as part of this release
Thanks for your reply, that's good to know 😄
Any updates on this? Thanks!