imbalanced-learn icon indicating copy to clipboard operation
imbalanced-learn copied to clipboard

Question: Generation of synthetic samples with SMOTE

Open lringel opened this issue 2 years ago • 1 comments

Hi,

I have a question regarding the generation of synthetic samples via SMOTE. The comments in the source code state, that a new sample is generated in the following manner:

s_{s} = s_{i} + u(0, 1) * (s_{i} - s_{nn})

After testing it myself, I come to the conclusion that the current implementation uses the same random number for each attribute. The code I used for testing:

from sklearn.datasets import make_classification

X, y = make_classification(n_classes=2, class_sep=2,
                           weights=[0.4, 0.6], n_informative=3, n_redundant=0, flip_y=0,
                           n_features=3, n_clusters_per_class=1, n_samples=5, random_state=42)

import pandas as pd

df_x = pd.DataFrame(X)
df_y = pd.DataFrame(y)

df = pd.concat([df_x, df_y], axis=1, join="inner")
df.columns = ['feature_1', 'feature_2', 'feature_3', 'label']
print(df)

from imblearn.over_sampling import SMOTE

i=0
while i<100:
    sm = SMOTE(k_neighbors=1)
    X_res, y_res = sm.fit_resample(X, y)

    df_x = pd.DataFrame(X_res)
    df_y = pd.DataFrame(y_res)

    df = pd.concat([df_x, df_y], axis=1, join="inner")
    df.columns = ['feature_1', 'feature_2', 'feature_3', 'label']

    dis_1 = df['feature_1'][4] - df['feature_1'][1]
    dis_2 = df['feature_2'][4] - df['feature_2'][1]
    dis_3 = df['feature_3'][4] - df['feature_3'][1]

    syn_dis_1 = df['feature_1'][4] - df['feature_1'][5]
    syn_dis_2 = df['feature_2'][4] - df['feature_2'][5]
    syn_dis_3 = df['feature_3'][4] - df['feature_3'][5]

    div_1 = syn_dis_1/dis_1
    div_2 = syn_dis_2/dis_2
    div_3 = syn_dis_3/dis_3
    print(div_1, div_2, div_3)

    i=i+1

If there aren't any mistakes in my example, I think the implementation is contradictory to the example shown in the SMOTE anniversary paper on page 6 of the pdf / 868 of the paper.

Can anyone clarify why the implementation uses the same random number for every attribute instead of different random numbers?

Thanks in advance!

lringel avatar Jun 19 '22 16:06 lringel

So it boils down to we generate the data point on the segment or in the hyper-rectangle.

I recall contacting the authors because the original paper advocates for the first solution and the subsequent papers (as well as the anniversary paper) advocate for the second solution.

The answer was that both approaches have been tried and did not lead to any difference. Actually, there is no theoretical argument in favour of the hyper-rectangle approach. In some way, generating on the segment is probably more mathematically sounding because it would be a kind of interpolation on the manifold formed by the neighbours. The hyper rectangle does not make sense there.

glemaitre avatar Jul 11 '22 15:07 glemaitre

Thank you for taking the time to answer my question in detail!

lringel avatar Aug 29 '22 19:08 lringel