practicalcheminformatics
practicalcheminformatics copied to clipboard
Building a multiclass classification model | Practical Cheminformatics
Building a multiclass classification model | Practical Cheminformatics
Data cleaning, adding structures to PubChem data, building a multiclass model, dealing with imbalanced data
I personally prefer bagging with balanced bootstraps over oversampling. But apart from that, cool post.
Are you only interested in a classifier for this dataset? I wonder if a good regressor can be trained from it.
Thanks for the comments. There's a lot more that I want to do with these datasets, stay tuned.
The precision for the activator really took a hit when oversampling. It's true that in the standard approach you hardly get any activator predictions, but when you do, there's a 58% chance it is correct, compared to a 27% for oversampling. Of course there's a large uncertainty in the 58% due to small sample size.
Good point, I should have gone into into the stats a bit more. I'm going to revise the post to include an assessment of the impact on precision and recall.
Hi Pat, Thanks for great post! I always get lots of useful information from your post and code ;) To tackle imbalance data I think it's worth to check Greg's presentation. https://www.slideshare.net/GregLandrum1/building-useful-models-for-imbalanced-datasets-without-resampling-166150891 http://rdkit.blogspot.com/2018/11/working-with-unbalanced-data-part-i.html
In real drug discovery project, we often have imbalance data, so it's really useful. Thanks!
Thanks, Taka! Imbalanced data is an important topic and I plan to talk about it more in future posts. As I mentioned in my reply to Jan, I also need to dig more deeply into the stats.
That sounds nice!!!!!!!