imbalanced-learn New methods

This is a non exhaustive list of the methods that can be added for the next release.

Oversampling:

[ ] SPIDER
[ ] MWMOTE
[ ] SMOTE-SL
[ ] SMOTE-RSB
[x] SMOTE-NC
[ ] Random-SMOTE https://github.com/scikit-learn-contrib/imbalanced-learn/issues/105#issuecomment-468189349
[ ] Cluster Based Oversampling https://github.com/scikit-learn-contrib/imbalanced-learn/issues/105#issuecomment-436260357
[ ] Supervised Over-Sampling https://github.com/scikit-learn-contrib/imbalanced-learn/issues/105#issuecomment-469255114

Prototype Generation/Selection:

[ ] Steady State Memetic Algorithm (SSMA)
[ ] Adaptive Self-Generating Prototypes (ASGP)

Ensemble

[x] Over-Bagging #808
[x] Under-Bagging #808
[x] Under-Over-Bagging #808
[x] SMOTE-Bagging #808
[x] RUS-Boost
[ ] SMOTE-Boost
[ ] RAMO-Boost
[ ] EUS-Boost

Regression

[ ] SMOTE for regression

P. Branco, L. Torgo and R. Ribeiro (2016). A Survey of Predictive Modeling on Imbalanced Domains. ACM Comput. Surv. 49, 2, 31. DOI: http://dx.doi.org/10.1145/2907070

Branco, P. and Torgo, L. and Ribeiro R.P. (2017) "Pre-processing Approaches for Imbalanced Distributions in Regression" Special Issue on Learning in the Presence of Class Imbalance and Concept Drift. Neurocomputing Journal. (submitted).

Jul 21 '16 12:07 glemaitre

@dvro @chkoar you can add anything there. We can make a PR to add these stuff in the todo list.

We should also discuss where these methods will be added (under-/over-sampling or new module)

Jul 21 '16 12:07 glemaitre

SGP it should be placed in a new module/package like in scikit-protopy. generation is a reasonable name for this kind of algorithm.

Jul 21 '16 13:07 chkoar

@chkoar What would be the reason to disassociate over-sampling and generation?

Jul 21 '16 13:07 glemaitre

Actually none. Just for semantic reasons. Obviously, prototype generation methods could be considered as over-sampling methods.

Jul 21 '16 13:07 chkoar

@glemaitre actually, oversampling is different than prototype generation:

Prototype Selection: given a set of samples S, a PS method selects a subset S', where S' \in S and |S'| < |S| Prototype Generation: given a set of samples S, a PG method generates a new set S', where |S'| < |S|. Oversampling: given a set of samples S, an OS method generates a new set S', where |S'| > |S| and S \in S'

Jul 21 '16 16:07 dvro

Thanks for the clarification @dvro. That could be placed in the wiki!

Jul 21 '16 17:07 chkoar

Hi,

If by SPIDER you mean algorithms from: "Selective Pre-processing of Imbalanced Data for Improving Classification Performance" and "Learning from imbalanced data in presence of noisy and borderline examples", maybe I could be of some help. I know the authors and maybe I could implement a python version of this algorithm with their "supervision"? That might be "safer" than using only pseudo-codes from conference papers.

Jan 20 '17 08:01 dabrze

Yes, it is this article. We would be happy for having PR on that. We are going to make a sprint at some point for developing some of the above methods.

The only important thing is to follow the scikit-learn convention regarding the estimator but this is something that we will also take care at the moment of the revision.

Jan 20 '17 09:01 glemaitre

MetaCost could be a nice addition.

Aug 31 '17 13:08 chkoar

Yep. You can added it up in the previous list.

Sep 02 '17 15:09 glemaitre

Hi, I hope this is a good place to write about it: I have an implementation of Roughly Balanced Bagging (Under-Bagging method) with an extension for multiclass problems (based on this article) as an extension of bagging class from sklearn, made a few months ago. I will gladly polish this implementation to match this package conventions for bagging classifiers and made a pull request if you are interested in such contribution.

Nov 22 '17 00:11 mwydmuch

@mwydmuch PRs are always welcome. With the addition of #360 will start the ensemble methods module and I think that we'll deprecate the current ensemble based samplers.

Nov 22 '17 00:11 chkoar

@glemaitre do you think that we should have requirements, e.g. number of citations, before we merge an implementation into the package?

Nov 22 '17 00:11 chkoar

                                                                                  I would say no. This is something that scikit-learn is doing but the contrib are here to give some freedom regarding that and have bleeding-age estimator. I would just require that the estimator to show some advantage on some benchmark such that we can explain to users when using it.

Nov 22 '17 06:11 glemaitre

@glemaitre I was thinking to ask @mwydmuch to include a comparison with the BalancedBaggingClassifier (#360) but I thought that would be a nice addition after the implementation, and not a requirement. I think that we are on the same side here. Apart from that, we actually have requirements like the dependesies, right?

Nov 22 '17 09:11 chkoar

yes, regarding the dependencies, we are limiting only numpy/scipy/scikit-learn. Then, we can see if we can vendor but it should be avoided as much as possible.

Regarding the comparison, it is a bit my point when making a benchmark. I need to fix #360 in fact :)

Nov 22 '17 09:11 glemaitre

Thank you for comments, I will look into #360 then. And I can also prepare comparison between these methods :)

Nov 23 '17 15:11 mwydmuch

@glemaitre I would be interested in adding RUSBoost as part of the algorithm. Would it be fine if we inherit the code from AdaBoost, since RUSBoost is similar to Adaboost, except for small changes in training part?

Dec 03 '17 16:12 souravsingh

@souravsingh it looks like it.

Dec 03 '17 19:12 glemaitre

@glemaitre Hi, I've worked on class-imbalance problems in the past. Over-sampling/under-sampling are too costly for big data problems. In my case, I tried to train oversampler on a dataset of size 0.2million X 1.4K, it ran out of memory on a PC having 32 GB RAM though I's using pandas sparse dataframe. Therefore, i would suggest to categorise algorithms in two category 1. Algorithms for small datasets 2. Algorithms for big datasets. In the second case, we can implement methods algorithms such as cost-sensitive learning, distributed and online based methods for class-imbalance learning. This will make the API a general purpose suitable for small scale as well as large scale datasets.

Jan 07 '18 14:01 chandreshiit

@chandu8542 I don't especially agree with your categori anzation which is more based on an engineering point-of-view rather than a "method" point-of-view. For instance, it would not make sense to classify SVM classifier for small scale problem category; the important feature is that the SVM classifier should live under a SVM module.

However, your comments are useful and should be used to improve the user guide/docstring of the different methods.

Regarding the cost-sensitive methods, we need to implement some and they would be useful for sure :)

Jan 07 '18 15:01 glemaitre

@glemaitre Within the engineering point-of-view, "method" point-of-view is embedded if you take a bird's-eye view. This will ease the task of a practitioner in choosing the right method for their problem at hand. That should be the sole purpose of open-source libraries; to make them more usable for research community as well as practitioner otherwise people (researcher/practitioner) will find it difficult to choose the right method when they have several methods under one umbrella. Rest is up to you.

Jan 08 '18 04:01 chandreshiit

I think that everyone is right in this discussion. However, I agree with @glemaitre that the main indexing should by method type, not characteristic. But it would be necessary to have @chandu8542 criteria on the benchmarking to see how all algorithms perform in terms of memory, speed, etc.. using some datasets at different set sizes. Of course, such benchmark should come with narrative documentation to guide the method's choice by the user. As always, PRs are welcome. We would gladly put our time into reviewing such PR so that nobody ever again faces the same troubles.

Jan 08 '18 14:01 massich

Cluster Based Oversampling¹

Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1), 40-49.

Nov 06 '18 13:11 chkoar

Random-SMOTE¹

Dong, Y., & Wang, X. (2011). A new over-sampling approach: random-SMOTE for learning from imbalanced data sets. In International Conference on Knowledge Science, Engineering and Management

Feb 28 '19 08:02 chkoar

Supervised Over-Sampling¹

1.Hu, J., He, X., Yu, D. J., Yang, X. B., Yang, J. Y., & Shen, H. B. (2014). A new supervised over-sampling algorithm with application to protein-nucleotide binding residue prediction. PloS one, 9(9), e107676.

Mar 04 '19 13:03 chkoar

A new one:

Sharifirad, S., Nazari, A., & Ghatee, M. (2018). Modified smote using mutual information and different sorts of entropies. arXiv preprint arXiv:1803.11002.

Includes MIESMOTE, MAESMOTE, RESMOTE and TESMOTE.

Since SMOTE is mostly a meta-algorithm to interpolate new sample, with a defined strategy that change depending on the author, would it be possible to implement a generic SMOTE model where the user can provide a custom function to make his own version of SMOTE? This might also ease the writing (and contribution) of new SMOTE models.

Aug 13 '19 23:08 lrq3000

Hi, I hope this is a good place to write about it: I have an implementation of Roughly Balanced Bagging (Under-Bagging method) with an extension for multiclass problems (based on this article) as an extension of bagging class from sklearn, made a few months ago. I will gladly polish this implementation to match this package conventions for bagging classifiers and made a pull request if you are interested in such contribution.

Hi Marek, Kindly share with me python implementation of Roughly Balanced Bagging, i will be gratefull for you help.

Thank you.

Haleem

Nov 23 '19 08:11 halimkas

Hello,

I am writing because in my current use case I am working on, we would love to have a certain oversampling feature, yet, it is not implemented anywhere. Therefore I would like to propose it here.

We are building an NLP model for binary classification, where one of the classes is strongly imbalanced. Therefore, one of the approaches would be to oversample using data augmentation techniques for nlp, e.g. using nlpaug library replace some words with synonyms. Having a class in the library, which allows to package the augmentation into the sklearn pipeline would be great! I can also see this being used in Computer Vision.

Let me know what do you think? Whether this could become one of the features in this library, and in that case I would love to contribute. If it doesn't fit into this library, do you know any other open source project where this would fit?

Cheers, Mateusz

Jan 26 '20 13:01 Matgrb

Not sure if this is the right place, but for my work I implemented a custom version of SMOTE for Regression as described in this paper:

Torgo L., Ribeiro R.P., Pfahringer B., Branco P. (2013) SMOTE for Regression. In: Correia L., Reis L.P., Cascalho J. (eds) Progress in Artificial Intelligence. EPIA 2013. Lecture Notes in Computer Science, vol 8154. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40669-0_33

As mentioned in the original post, it would be nice to get SMOTE for Regression in imbalanced-learn.

Jul 31 '20 09:07 beeb

imbalanced-learn imbalanced-learn copied to clipboard

New methods

imbalanced-learn
imbalanced-learn copied to clipboard