FlowMO icon indicating copy to clipboard operation
FlowMO copied to clipboard

String kernels can exploit biases in SMILES string format, skewing performance metrics

Open cyrusmaher opened this issue 4 years ago • 4 comments

Just a heads up on this issue: https://github.com/deepchem/moleculenet/issues/15

I propose that string-based classifiers canonicalize smiles prior to processing to prevent confounded performance, CI, etc. estimates.

cyrusmaher avatar Dec 15 '20 00:12 cyrusmaher

Thanks for raising this! I made a change to the photoswitch dataset to canonicalise all SMILES as a preprocessing step a couple of weeks ago, will make sure this is implemented for the other datasets!

Ryan-Rhys avatar Dec 18 '20 20:12 Ryan-Rhys

image It's possible canonicalization doesn't fully eliminate the bias (e.g. if one set calculates smiles at a different pH or is more likely to include salt forms). You can see that in the enrichment for "." and "+" characters between positive and negative examples in clintox.

cyrusmaher avatar Dec 22 '20 00:12 cyrusmaher

Interesting! @henrymoss and I will keep track of this conversation you guys are having in DeepChem!

Ryan-Rhys avatar Dec 22 '20 00:12 Ryan-Rhys

This is interesting indeed!

I wonder if this aligns with the observed lack of improvements we were getting when augmenting the data by adding extra (non-canonical) SMILES. Basically, we could only learn from training data in canonical form, as our test data was also canonical. Even increasing the data x5 (through augmentation), we couldn't improve performance.

henrymoss avatar Dec 22 '20 10:12 henrymoss