smote_variants icon indicating copy to clipboard operation
smote_variants copied to clipboard

when use SOMO,Why did the two types of samples not reach a balance and the number did not change

Open leaphan opened this issue 3 years ago • 2 comments

leaphan avatar Apr 23 '21 01:04 leaphan

There can be multiple reasons for that. In many cases the authors of a particular SMOTE variant did not cover all the possible corner cases, for example,

  1. all minority samples are treated as noise according to the noise definition of the technique,
  2. the method wants to work with, say, 5 nearest neighbors, but there are only 3 minority samples,
  3. mathematical techniques like self-organizing maps, do not converge,
  4. etc.,

all of these because of the nature of the data is not compatible with the parameter settings and presumptions of the SMOTE variant.

Where I found reasonable resolutions, I implemented them, in those cases when it is unfeasible (for example, determining the 5 closest neighbors when you have only 3 samples in a class), the data is returned unaltered, although I would expect some message in the logs if logging is enabled.

Most likely your data is a corner case of the SOMO implementation with the parameters you used. Adjusting the parameters might lead to a properly operating SOMO.

Also, if you share a minimal working example, I can look into it.

gykovacs avatar Apr 23 '21 17:04 gykovacs

thanks for your reply, i wrote a code like this:

pip install -U imbalanced-learn pip install smote-variants import numpy as np import smote_variants as sv #import imblearn.datasets as imbd from imblearn.datasets import fetch_datasets

datasets = fetch_datasets(filter_data=['oil']) X, y = datasets['oil']['data'], datasets['oil']['target'] [print('Class {} has {} instances'.format(label, count)) for label, count in zip(*np.unique(y, return_counts=True))]

oversampler= sv.SOMO() X_samp, y_samp= oversampler.sample(X, y)

[print('Class {} has {} instances after oversampling'.format(label, count)) for label, count in zip(*np.unique(y_samp, return_counts=True))] print(X_samp, y_samp)

and the print result : Class -1 has 896 instances Class 1 has 41 instances Class -1 has 896 instances after oversampling Class 1 has 41 instances after oversampling After oversampling, There is no change in the number of two types of samples.

leaphan avatar Apr 25 '21 02:04 leaphan