imbalanced-learn
imbalanced-learn copied to clipboard
[SO] SMOTEEN generates imbalance dataset
Hi everyone,
I'm fairly new in the machine learning field, so my apologies if the question seems very simple. I'm trying to do some classification on several datasets with some being not well separated. I observed sometimes that SMOTEEN output can result to an even more unbalanced dataset. A small example:
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=10, n_informative=2,
n_redundant=0, n_repeated=0, n_classes=2,
n_clusters_per_class=1,
weights=[0.2, 0.8],
class_sep=0.8, random_state=0)
y = pd.Series(y)
y.value_counts()
# Result
# 1 796
# 0 204
You can already see with this dataset that SMOTEEN is not performing well:
from imblearn.combine import SMOTEENN
se = SMOTEENN(random_state=0)
X_se, y_se = se.fit_resample(X, y)
y_se.value_counts()
# Result
# 1 745
# 0 559
When you run the two components separately, we can see the issue here:
s = SMOTE(random_state=0)
X_s, y_s = s.fit_resample(X, y)
e = ENN()
X_enn, y_enn = se.fit_resample(X_s, y_s)
y_s.value_counts()
# 1 796
# 0 796
y_enn.value_counts()
# 1 745
# 0 559
So clearly, the ENN method undersampled way more samples from one class compared to the others, as it treated both class equally after the SMOTE process. I assume here the reason is that there is less variation within the over-sampled class compared to the dominant. In some real datasets, I sometimes even observed the complete disappearance of one of the classes. While I think I understand the reason behind this, I wonder if this is not an issue that some users might not be aware of, as this behavior is completely silent in most cases when used as a pipeline. Is this the intended behavior? Thanks!