imbalanced-learn
imbalanced-learn copied to clipboard
Add method for regression
As title. and I find a method in R as following: https://github.com/paobranco/Pre-processingApproachesImbalanceRegression
and paper as : https://www.semanticscholar.org/paper/SMOTE-for-Regression-Torgo-Ribeiro/43cda672b9ac0833086e19c90d42c2c0fbc361c6
I am not opposed to it.
closing in favor of #105
Hi @glemaitre am I right that currently only BalancedRandomForestClassifier
from imblearn.ensemble
can take real numbers as y
for regression problems? Other ensemble models such as RUSBoostClassifier
cannot do this? The oversampling strategies cannot do this either?
Thanks!
Hi @glemaitre am I right that currently only BalancedRandomForestClassifier from imblearn.ensemble can take real numbers as y for regression problems? Other ensemble models such as RUSBoostClassifier cannot do this?
@bluemonk482 the name of the models you mentioned ends with Classifier
. That implies that are applicable in classification tasks.
The oversampling strategies cannot do this either?
Currently no, but we are interested on including an implementation of such a method.
Thanks @chkoar !
I assume it is more complex than simply changing class BalancedRandomForestClassifier(RandomForestClassifier)
to class BalancedRandomForestClassifier(RandomForestRegressor)
in https://github.com/scikit-learn-contrib/imbalanced-learn/blob/c0aa81c40173bd28b863ccc1b82bbafcacb240c4/imblearn/ensemble/_forest.py ???
Yes because you need to understand and make a proper resampling strategy in the context of regression which is not really straightforward and there is almost no literature on this.
On Tue, 30 Jul 2019 at 15:13, bluemonk482 [email protected] wrote:
Thanks @chkoar https://github.com/chkoar !
I assume it is more complex than simply changing class BalancedRandomForestClassifier(RandomForestClassifier) to class BalancedRandomForestClassifier(RandomForestRegressor) in https://github.com/scikit-learn-contrib/imbalanced-learn/blob/c0aa81c40173bd28b863ccc1b82bbafcacb240c4/imblearn/ensemble/_forest.py ???
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/imbalanced-learn/issues/571?email_source=notifications&email_token=ABY32P44ML33YLHD4EI62A3QCA5A3A5CNFSM4HNZNXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3D5MVQ#issuecomment-516413014, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY32P2YI5JL4TJ4OGTZV43QCA5A3ANCNFSM4HNZNXWA .
-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/
Understood. Thanks @glemaitre !
@glemaitre this thread is such a godsend for me! so, i understand there is no way presently to generate synthetic data for regression problems where obviously the output variable Y is a continuous value. is that correct ? Can the expert machine learners here suggest some way out of this sort of a problem then? more details included in my post - https://stats.stackexchange.com/questions/433740/regression-on-unevenly-distributed-high-dimensional-dataset
I reopen this issue, we could make a generic tool which would quantize the target and allow to apply any sampler. We could think about a meta-estimator to do the job. This would require what is called a relevance function.
I believe these are relevant for this issue:
-
Torgo, Luís, et al. "Smote for regression." Portuguese conference on artificial intelligence. Springer, Berlin, Heidelberg, 2013.
-
Torgo, Luís, et al. "Resampling strategies for regression." Expert Systems 32.3 (2015): 465-476.
-
Branco, Paula. "Re-sampling approaches for regression tasks under imbalanced domains." Unpublished Master's Thesis), Dep. Computer Science, Faculty of Sciences‐University of Porto (2014).
-
Branco, Paula Oliveira, Luís Torgo, and Rita Paula Ribeiro. "SMOGN: a pre-processing approach for imbalanced regression." (2017).
https://github.com/paobranco
She wrote several papers on the topic and has some of them implemented in R.
I think the most simple way to do it without adding new methods, is to discretize target (uniformly or kmeans, quantiles won't do), then fit oversampler and then make an inverse transform (assign midrange bin values instead of bin numbers).
It should work through Pipeline and TargetTransformer.
I also vote for SMOTER. I don't want to have to download a different package https://pypi.org/project/smogn/ to do SMOTE with regression problems.