imbalanced-learn icon indicating copy to clipboard operation
imbalanced-learn copied to clipboard

Pipeline performs SMOTE both over train and validation sets

Open Marinaobdulia opened this issue 2 years ago • 2 comments

I have been using imblearn Pipeline to apply SMOTE, but I have realized that it is sampling both the train and validation sets. I get the same results when I apply Pipeline and when instead of applying Pipeline I perform sampling over the train and validation sets and then train my xgboost model.

I believe what is happening is that imblearn pipeline passes through those transformers whose method is "fit_resample" while SMOTE method name is fit_sample, leading the Pipeline not to passthrough SMOTE and sampling also over the validation set.

Any ideas about this?

Marinaobdulia avatar Dec 23 '21 09:12 Marinaobdulia

It appears to me that the SMOTE class does have fit_resample, not fit_sample. Can you provide a minimal example that shows the validation set being resampled?

bmreiniger avatar Dec 27 '21 18:12 bmreiniger

fit_resample and fit_sample are just some alias.

@Marinaobdulia could you provide a minimal example. My insights there would be that XGBoost does not provide a fully compatible scikit-learn model (that does pass the check_estimator) and thus our pipeline does not work as expected. However, an example could allow to check why this is the case and if we can do something in imbalanced-learn and or maybe propose a fix upstream.

glemaitre avatar Jan 11 '22 14:01 glemaitre

closing since we don't have additional information.

glemaitre avatar Dec 03 '22 22:12 glemaitre