imbalanced-learn
imbalanced-learn copied to clipboard
New methods
This is a non exhaustive list of the methods that can be added for the next release.
Oversampling:
- [ ] SPIDER
- [ ] MWMOTE
- [ ] SMOTE-SL
- [ ] SMOTE-RSB
- [x] SMOTE-NC
- [ ] Random-SMOTE https://github.com/scikit-learn-contrib/imbalanced-learn/issues/105#issuecomment-468189349
- [ ] Cluster Based Oversampling https://github.com/scikit-learn-contrib/imbalanced-learn/issues/105#issuecomment-436260357
- [ ] Supervised Over-Sampling https://github.com/scikit-learn-contrib/imbalanced-learn/issues/105#issuecomment-469255114
Prototype Generation/Selection:
- [ ] Steady State Memetic Algorithm (SSMA)
- [ ] Adaptive Self-Generating Prototypes (ASGP)
Ensemble
- [x] Over-Bagging #808
- [x] Under-Bagging #808
- [x] Under-Over-Bagging #808
- [x] SMOTE-Bagging #808
- [x] RUS-Boost
- [ ] SMOTE-Boost
- [ ] RAMO-Boost
- [ ] EUS-Boost
Regression
- [ ] SMOTE for regression
P. Branco, L. Torgo and R. Ribeiro (2016). A Survey of Predictive Modeling on Imbalanced Domains. ACM Comput. Surv. 49, 2, 31. DOI: http://dx.doi.org/10.1145/2907070
Branco, P. and Torgo, L. and Ribeiro R.P. (2017) "Pre-processing Approaches for Imbalanced Distributions in Regression" Special Issue on Learning in the Presence of Class Imbalance and Concept Drift. Neurocomputing Journal. (submitted).
@dvro @chkoar you can add anything there. We can make a PR to add these stuff in the todo list.
We should also discuss where these methods will be added (under-/over-sampling or new module)
SGP it should be placed in a new module/package like in scikit-protopy. generation
is a reasonable name for this kind of algorithm.
@chkoar What would be the reason to disassociate over-sampling
and generation
?
Actually none. Just for semantic reasons. Obviously, prototype generation methods could be considered as over-sampling methods.
@glemaitre actually, oversampling is different than prototype generation:
Prototype Selection: given a set of samples S, a PS method selects a subset S', where S' \in S and |S'| < |S| Prototype Generation: given a set of samples S, a PG method generates a new set S', where |S'| < |S|. Oversampling: given a set of samples S, an OS method generates a new set S', where |S'| > |S| and S \in S'
Thanks for the clarification @dvro. That could be placed in the wiki!
Hi,
If by SPIDER you mean algorithms from: "Selective Pre-processing of Imbalanced Data for Improving Classification Performance" and "Learning from imbalanced data in presence of noisy and borderline examples", maybe I could be of some help. I know the authors and maybe I could implement a python version of this algorithm with their "supervision"? That might be "safer" than using only pseudo-codes from conference papers.
Yes, it is this article. We would be happy for having PR on that. We are going to make a sprint at some point for developing some of the above methods.
The only important thing is to follow the scikit-learn convention regarding the estimator but this is something that we will also take care at the moment of the revision.
MetaCost could be a nice addition.
Yep. You can added it up in the previous list.
Hi, I hope this is a good place to write about it: I have an implementation of Roughly Balanced Bagging (Under-Bagging method) with an extension for multiclass problems (based on this article) as an extension of bagging class from sklearn, made a few months ago. I will gladly polish this implementation to match this package conventions for bagging classifiers and made a pull request if you are interested in such contribution.
@mwydmuch PRs are always welcome. With the addition of #360 will start the ensemble methods module and I think that we'll deprecate the current ensemble based samplers.
@glemaitre do you think that we should have requirements, e.g. number of citations, before we merge an implementation into the package?
I would say no. This is something that scikit-learn is doing but the contrib are here to give some freedom regarding that and have bleeding-age estimator. I would just require that the estimator to show some advantage on some benchmark such that we can explain to users when using it.
@glemaitre I was thinking to ask @mwydmuch to include a comparison with the BalancedBaggingClassifier
(#360) but I thought that would be a nice addition after the implementation, and not a requirement. I think that we are on the same side here. Apart from that, we actually have requirements like the dependesies, right?
yes, regarding the dependencies, we are limiting only numpy/scipy/scikit-learn. Then, we can see if we can vendor but it should be avoided as much as possible.
Regarding the comparison, it is a bit my point when making a benchmark. I need to fix #360 in fact :)
Thank you for comments, I will look into #360 then. And I can also prepare comparison between these methods :)
@glemaitre I would be interested in adding RUSBoost as part of the algorithm. Would it be fine if we inherit the code from AdaBoost, since RUSBoost is similar to Adaboost, except for small changes in training part?
@souravsingh it looks like it.
@glemaitre Hi, I've worked on class-imbalance problems in the past. Over-sampling/under-sampling are too costly for big data problems. In my case, I tried to train oversampler on a dataset of size 0.2million X 1.4K, it ran out of memory on a PC having 32 GB RAM though I's using pandas sparse dataframe. Therefore, i would suggest to categorise algorithms in two category 1. Algorithms for small datasets 2. Algorithms for big datasets. In the second case, we can implement methods algorithms such as cost-sensitive learning, distributed and online based methods for class-imbalance learning. This will make the API a general purpose suitable for small scale as well as large scale datasets.
@chandu8542 I don't especially agree with your categori anzation which is more based on an engineering point-of-view rather than a "method" point-of-view. For instance, it would not make sense to classify SVM classifier for small scale problem category; the important feature is that the SVM classifier should live under a SVM module.
However, your comments are useful and should be used to improve the user guide/docstring of the different methods.
Regarding the cost-sensitive methods, we need to implement some and they would be useful for sure :)
@glemaitre Within the engineering point-of-view, "method" point-of-view is embedded if you take a bird's-eye view. This will ease the task of a practitioner in choosing the right method for their problem at hand. That should be the sole purpose of open-source libraries; to make them more usable for research community as well as practitioner otherwise people (researcher/practitioner) will find it difficult to choose the right method when they have several methods under one umbrella. Rest is up to you.
I think that everyone is right in this discussion. However, I agree with @glemaitre that the main indexing should by method type, not characteristic. But it would be necessary to have @chandu8542 criteria on the benchmarking to see how all algorithms perform in terms of memory, speed, etc.. using some datasets at different set sizes. Of course, such benchmark should come with narrative documentation to guide the method's choice by the user. As always, PRs are welcome. We would gladly put our time into reviewing such PR so that nobody ever again faces the same troubles.
Cluster Based Oversampling1
- Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1), 40-49.
Random-SMOTE1
- Dong, Y., & Wang, X. (2011). A new over-sampling approach: random-SMOTE for learning from imbalanced data sets. In International Conference on Knowledge Science, Engineering and Management
Supervised Over-Sampling1
1.Hu, J., He, X., Yu, D. J., Yang, X. B., Yang, J. Y., & Shen, H. B. (2014). A new supervised over-sampling algorithm with application to protein-nucleotide binding residue prediction. PloS one, 9(9), e107676.
A new one:
Sharifirad, S., Nazari, A., & Ghatee, M. (2018). Modified smote using mutual information and different sorts of entropies. arXiv preprint arXiv:1803.11002.
Includes MIESMOTE, MAESMOTE, RESMOTE and TESMOTE.
Since SMOTE is mostly a meta-algorithm to interpolate new sample, with a defined strategy that change depending on the author, would it be possible to implement a generic SMOTE model where the user can provide a custom function to make his own version of SMOTE? This might also ease the writing (and contribution) of new SMOTE models.
Hi, I hope this is a good place to write about it: I have an implementation of Roughly Balanced Bagging (Under-Bagging method) with an extension for multiclass problems (based on this article) as an extension of bagging class from sklearn, made a few months ago. I will gladly polish this implementation to match this package conventions for bagging classifiers and made a pull request if you are interested in such contribution.
Hi, I hope this is a good place to write about it: I have an implementation of Roughly Balanced Bagging (Under-Bagging method) with an extension for multiclass problems (based on this article) as an extension of bagging class from sklearn, made a few months ago. I will gladly polish this implementation to match this package conventions for bagging classifiers and made a pull request if you are interested in such contribution.
Hi Marek, Kindly share with me python implementation of Roughly Balanced Bagging, i will be gratefull for you help.
Thank you.
Haleem
Hello,
I am writing because in my current use case I am working on, we would love to have a certain oversampling feature, yet, it is not implemented anywhere. Therefore I would like to propose it here.
We are building an NLP model for binary classification, where one of the classes is strongly imbalanced. Therefore, one of the approaches would be to oversample using data augmentation techniques for nlp, e.g. using nlpaug library replace some words with synonyms. Having a class in the library, which allows to package the augmentation into the sklearn pipeline would be great! I can also see this being used in Computer Vision.
Let me know what do you think? Whether this could become one of the features in this library, and in that case I would love to contribute. If it doesn't fit into this library, do you know any other open source project where this would fit?
Cheers, Mateusz
Not sure if this is the right place, but for my work I implemented a custom version of SMOTE for Regression as described in this paper:
Torgo L., Ribeiro R.P., Pfahringer B., Branco P. (2013) SMOTE for Regression. In: Correia L., Reis L.P., Cascalho J. (eds) Progress in Artificial Intelligence. EPIA 2013. Lecture Notes in Computer Science, vol 8154. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40669-0_33
As mentioned in the original post, it would be nice to get SMOTE for Regression in imbalanced-learn
.