practicalcheminformatics icon indicating copy to clipboard operation
practicalcheminformatics copied to clipboard

Building a multiclass classification model | Practical Cheminformatics

Open utterances-bot opened this issue 3 years ago • 8 comments

Building a multiclass classification model | Practical Cheminformatics

Data cleaning, adding structures to PubChem data, building a multiclass model, dealing with imbalanced data

https://patwalters.github.io/practicalcheminformatics/jupyter/multiclass/pubchem/imbalanced/2021/08/28/multiclass-classification.html

utterances-bot avatar Sep 01 '21 01:09 utterances-bot

I personally prefer bagging with balanced bootstraps over oversampling. But apart from that, cool post.

UnixJunkie avatar Sep 01 '21 01:09 UnixJunkie

Are you only interested in a classifier for this dataset? I wonder if a good regressor can be trained from it.

UnixJunkie avatar Sep 01 '21 01:09 UnixJunkie

Thanks for the comments. There's a lot more that I want to do with these datasets, stay tuned.

PatWalters avatar Sep 01 '21 01:09 PatWalters

The precision for the activator really took a hit when oversampling. It's true that in the standard approach you hardly get any activator predictions, but when you do, there's a 58% chance it is correct, compared to a 27% for oversampling. Of course there's a large uncertainty in the 58% due to small sample size.

jhjensen2 avatar Sep 01 '21 07:09 jhjensen2

Good point, I should have gone into into the stats a bit more. I'm going to revise the post to include an assessment of the impact on precision and recall.

PatWalters avatar Sep 01 '21 11:09 PatWalters

Hi Pat, Thanks for great post! I always get lots of useful information from your post and code ;) To tackle imbalance data I think it's worth to check Greg's presentation. https://www.slideshare.net/GregLandrum1/building-useful-models-for-imbalanced-datasets-without-resampling-166150891 http://rdkit.blogspot.com/2018/11/working-with-unbalanced-data-part-i.html

In real drug discovery project, we often have imbalance data, so it's really useful. Thanks!

iwatobipen avatar Sep 01 '21 12:09 iwatobipen

Thanks, Taka! Imbalanced data is an important topic and I plan to talk about it more in future posts. As I mentioned in my reply to Jan, I also need to dig more deeply into the stats.

PatWalters avatar Sep 01 '21 12:09 PatWalters

That sounds nice!!!!!!!

iwatobipen avatar Sep 01 '21 12:09 iwatobipen