imbalanced-learn [SO] SMOTEEN generates imbalance dataset

[SO] SMOTEEN generates imbalance dataset

Open jsgounot opened this issue 1 year ago • 0 comments

Hi everyone,

I'm fairly new in the machine learning field, so my apologies if the question seems very simple. I'm trying to do some classification on several datasets with some being not well separated. I observed sometimes that SMOTEEN output can result to an even more unbalanced dataset. A small example:

import pandas as pd
import numpy as np

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=10, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.2, 0.8],
                           class_sep=0.8, random_state=0)

y = pd.Series(y)
y.value_counts()

# Result
# 1    796
# 0    204

You can already see with this dataset that SMOTEEN is not performing well:

from imblearn.combine import SMOTEENN

se = SMOTEENN(random_state=0)
X_se, y_se = se.fit_resample(X, y)
y_se.value_counts()

# Result
# 1    745
# 0    559

When you run the two components separately, we can see the issue here:

s = SMOTE(random_state=0)
X_s, y_s = s.fit_resample(X, y)

e = ENN()
X_enn, y_enn = se.fit_resample(X_s, y_s)

y_s.value_counts()
# 1    796
# 0    796

y_enn.value_counts()
# 1    745
# 0    559

So clearly, the ENN method undersampled way more samples from one class compared to the others, as it treated both class equally after the SMOTE process. I assume here the reason is that there is less variation within the over-sampled class compared to the dominant. In some real datasets, I sometimes even observed the complete disappearance of one of the classes. While I think I understand the reason behind this, I wonder if this is not an issue that some users might not be aware of, as this behavior is completely silent in most cases when used as a pipeline. Is this the intended behavior? Thanks!

Feb 21 '24 08:02 jsgounot

imbalanced-learn imbalanced-learn copied to clipboard

[SO] SMOTEEN generates imbalance dataset

imbalanced-learn
imbalanced-learn copied to clipboard