imbalanced-learn icon indicating copy to clipboard operation
imbalanced-learn copied to clipboard

Add method for regression

Open chjq201410695 opened this issue 5 years ago • 13 comments

As title. and I find a method in R as following: https://github.com/paobranco/Pre-processingApproachesImbalanceRegression

and paper as : https://www.semanticscholar.org/paper/SMOTE-for-Regression-Torgo-Ribeiro/43cda672b9ac0833086e19c90d42c2c0fbc361c6

chjq201410695 avatar May 18 '19 03:05 chjq201410695

I am not opposed to it.

glemaitre avatar Jun 07 '19 12:06 glemaitre

closing in favor of #105

glemaitre avatar Jun 11 '19 22:06 glemaitre

Hi @glemaitre am I right that currently only BalancedRandomForestClassifier from imblearn.ensemble can take real numbers as y for regression problems? Other ensemble models such as RUSBoostClassifier cannot do this? The oversampling strategies cannot do this either?

Thanks!

bwang482 avatar Jul 30 '19 12:07 bwang482

Hi @glemaitre am I right that currently only BalancedRandomForestClassifier from imblearn.ensemble can take real numbers as y for regression problems? Other ensemble models such as RUSBoostClassifier cannot do this?

@bluemonk482 the name of the models you mentioned ends with Classifier. That implies that are applicable in classification tasks.

The oversampling strategies cannot do this either?

Currently no, but we are interested on including an implementation of such a method.

chkoar avatar Jul 30 '19 12:07 chkoar

Thanks @chkoar !

I assume it is more complex than simply changing class BalancedRandomForestClassifier(RandomForestClassifier) to class BalancedRandomForestClassifier(RandomForestRegressor) in https://github.com/scikit-learn-contrib/imbalanced-learn/blob/c0aa81c40173bd28b863ccc1b82bbafcacb240c4/imblearn/ensemble/_forest.py ???

bwang482 avatar Jul 30 '19 13:07 bwang482

Yes because you need to understand and make a proper resampling strategy in the context of regression which is not really straightforward and there is almost no literature on this.

On Tue, 30 Jul 2019 at 15:13, bluemonk482 [email protected] wrote:

Thanks @chkoar https://github.com/chkoar !

I assume it is more complex than simply changing class BalancedRandomForestClassifier(RandomForestClassifier) to class BalancedRandomForestClassifier(RandomForestRegressor) in https://github.com/scikit-learn-contrib/imbalanced-learn/blob/c0aa81c40173bd28b863ccc1b82bbafcacb240c4/imblearn/ensemble/_forest.py ???

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/imbalanced-learn/issues/571?email_source=notifications&email_token=ABY32P44ML33YLHD4EI62A3QCA5A3A5CNFSM4HNZNXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3D5MVQ#issuecomment-516413014, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY32P2YI5JL4TJ4OGTZV43QCA5A3ANCNFSM4HNZNXWA .

-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/

glemaitre avatar Jul 30 '19 13:07 glemaitre

Understood. Thanks @glemaitre !

bwang482 avatar Jul 30 '19 13:07 bwang482

@glemaitre this thread is such a godsend for me! so, i understand there is no way presently to generate synthetic data for regression problems where obviously the output variable Y is a continuous value. is that correct ? Can the expert machine learners here suggest some way out of this sort of a problem then? more details included in my post - https://stats.stackexchange.com/questions/433740/regression-on-unevenly-distributed-high-dimensional-dataset

akatav avatar Oct 30 '19 07:10 akatav

I reopen this issue, we could make a generic tool which would quantize the target and allow to apply any sampler. We could think about a meta-estimator to do the job. This would require what is called a relevance function.

glemaitre avatar Nov 17 '19 11:11 glemaitre

I believe these are relevant for this issue:

  • Torgo, Luís, et al. "Smote for regression." Portuguese conference on artificial intelligence. Springer, Berlin, Heidelberg, 2013.

  • Torgo, Luís, et al. "Resampling strategies for regression." Expert Systems 32.3 (2015): 465-476.

  • Branco, Paula. "Re-sampling approaches for regression tasks under imbalanced domains." Unpublished Master's Thesis), Dep. Computer Science, Faculty of Sciences‐University of Porto (2014).

  • Branco, Paula Oliveira, Luís Torgo, and Rita Paula Ribeiro. "SMOGN: a pre-processing approach for imbalanced regression." (2017).

ogencoglu avatar Jan 24 '20 06:01 ogencoglu

https://github.com/paobranco

She wrote several papers on the topic and has some of them implemented in R.

glevv avatar Feb 10 '21 18:02 glevv

I think the most simple way to do it without adding new methods, is to discretize target (uniformly or kmeans, quantiles won't do), then fit oversampler and then make an inverse transform (assign midrange bin values instead of bin numbers).

It should work through Pipeline and TargetTransformer.

glevv avatar Apr 28 '21 10:04 glevv

I also vote for SMOTER. I don't want to have to download a different package https://pypi.org/project/smogn/ to do SMOTE with regression problems.

pavelkomarov avatar Jul 12 '21 20:07 pavelkomarov