imbalanced-learn icon indicating copy to clipboard operation
imbalanced-learn copied to clipboard

[WIP] ENH: Class Senstive Scaling

Open BernhardSchlegel opened this issue 7 years ago • 17 comments

What does this implement/fix? Explain your changes.

This implements a new technique called "class senstive scaling" that removes borderline noise by scaling samples to their corresponding class center. This computes insanely fast and eases the identification of a decision boundary and is a alternative to Tomek Link based concepts like CNN or OSS that supposedly reveal the decision boundary by removing noisy borderline samples. For details please refer:

B. Schlegel, and B. Sick. "Dealing with class imbalance the scalable way: Evaluation of various techniques based on classification grade and computational complexity." 2017 IEEE International Conference on Data Mining Workshops, 2017.

Any other comments?

This is my first pull request!

If you want to have a simple visualization you can use the following snippet

import numpy as np
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

# create sample dataset
rng = np.random.RandomState(42)
n_samples_1 = 50
n_samples_2 = 5
X_syn = np.r_[1.5 * rng.randn(n_samples_1, 2),
              0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y_syn = np.array([0] * (n_samples_1) + [1] * (n_samples_2))
X_syn, y_syn = shuffle(X_syn, y_syn)
X_syn_train, X_syn_test, y_syn_train, y_syn_test = train_test_split(X_syn, y_syn)
idx_class_0_orig = y_syn == 0

# apply CSS
from imblearn.scaling import CSS
css = CSS(sampling_strategy="both", mode="linear", c=0.1, shuffle=True)
X_train_res, y_train_res = css.fit_sample(X_syn, y_syn)

# plot original dataset
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(14, 6))
ax = fig.add_subplot(1, 2, 1)
import matplotlib.pyplot as plt
plt.scatter(X_syn[idx_class_0_orig, 0], X_syn[idx_class_0_orig, 1],
            alpha=.8, label='Class #0')
plt.scatter(X_syn[~idx_class_0_orig, 0], X_syn[~idx_class_0_orig, 1],
            alpha=.8, label='Class #1')
ax.set_xlim([-5, 5])
ax.set_ylim([-5, 5])
plt.yticks(range(-5, 6))
plt.xticks(range(-5, 6))

plt.title('Original dataset')
plt.legend()

# plot CSS dataset
idx_class_0 = y_train_res == 0
ax = fig.add_subplot(1, 2, 2)
plt.scatter(X_train_res[idx_class_0, 0], X_train_res[idx_class_0, 1],
            alpha=.8, label='Class #0')
plt.scatter(X_train_res[~idx_class_0, 0], X_train_res[~idx_class_0, 1],
            alpha=.8, label='Class #1')
ax.set_xlim([-5, 5])
ax.set_ylim([-5, 5])
plt.yticks(range(-5, 6))
plt.xticks(range(-5, 6))
plt.title('Scaling using CSS')
plt.legend()
plt.show()

Thanks for the great library by the way !

edit: updated sample code to match code changes.

BernhardSchlegel avatar Mar 27 '18 20:03 BernhardSchlegel

Looking forward to it. I'll try to check it asap. In the meanwhile, if you can check why the tests are not passing.

glemaitre avatar Mar 28 '18 05:03 glemaitre

Hey, thanks for the lightning fast response!

regarding travis:

  1. I had a ", newline" in my code-example, which is not allowed. Fixed that.
  2. Then there is a TypeError: mean() got an unexpected keyword argument 'dtype' error, but I don't call mean() anywhere in my code with more than the array a and the axis set.
  3. assert_allclose(X_res, X_res_ova) from check_samplers_multiclass_ova fails. But I think this is by design, since CSS is supposed to scale (=move points to the corresponding class center). How do we proceed here? This error comes up twice

regarding AppVeyor, in addition:

  1. NotImplementedError: subtracting a nonzero scalar from a sparse matrix is not supported: I'll look into that, solution is not straightforward. Ignoring sparse matrices with a warning ("CSS only supports dense feature matrices"), implementing some sort of imputation or solving it algorithmically to support sparse matrices are possible solutions.

@glemaitre How do think to tackle 3. (error by design)?

thank you so much for your efforts!

BernhardSchlegel avatar Mar 28 '18 07:03 BernhardSchlegel

assert_allclose(X_res, X_res_ova) from check_samplers_multiclass_ova fails. But I think this is by design, since CSS is supposed to scale (=move points to the corresponding class center). How do we proceed here? This error comes up twice

We can skip the test if does not apply. I have to check a bit more what the test was doing

NotImplementedError: subtracting a nonzero scalar from a sparse matrix is not supported: I'll look into that, solution is not straightforward. Ignoring sparse matrices with a warning ("CSS only supports dense feature matrices"), implementing some sort of imputation or solving it algorithmically to support sparse matrices are possible solutions.

This is strange that this is passing in travis. I don't think that your code can manage sparse matrices for the moment (you need to use safe_indexing). Even if the matrix is non sparse, you should take care about the format. it is similar to clustercentroid sampler.

Actually this is also in travis.

Then there is a TypeError: mean() got an unexpected keyword argument 'dtype' error, but I don't call mean() anywhere in my code with more than the array a and the axis set.

It is happening with sparse matrices :) I think that we should refer to https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/utils/sparsefuncs.py#L65

to compute the mean vectors.

glemaitre avatar Mar 28 '18 07:03 glemaitre

Thanks for the detailled feedback! I try to fix everything you pointed out and report back as soon I'm done or have questions.

Please keep me updated regarding the "we can skip that test" / error by design issue.

Thanks a lot !

BernhardSchlegel avatar Mar 28 '18 07:03 BernhardSchlegel

We also miss some documentation. We need to add an example to illustrate how to use the class and how it is working. On the other hand we need to add a section in the user guide. We have a new type of "sampler" which is more a scaler so we would need a new section.

glemaitre avatar Mar 28 '18 16:03 glemaitre

Just a quick comment: I would use the name ClassSenstiveScaling instead of CSS.

chkoar avatar Mar 28 '18 16:03 chkoar

@chkoar Since that this method does not over- or under- samples. It just only scale. In some way, I have the impression that it should inherit from TransformerMixin and should use fit_transform since y is not modify.

We could still use the sampling_strategy utls inside the class.

What are you thoughts on that.

glemaitre avatar Mar 28 '18 16:03 glemaitre

Since that this method does not over- or under- samples. It just only scale. In some way, I have the impression that it should inherit from TransformerMixin and should use fit_transform since y is not modify.

@glemaitre I agree. We could place that class in the preprocessing module/package to stay inline with scikit-learn.

chkoar avatar Mar 29 '18 11:03 chkoar

May I ask if there is anything I could help with ?

BernhardSchlegel avatar Apr 22 '18 17:04 BernhardSchlegel

Sorry I forgot to mention. Since that we actually do not sample here but scale, we think that it should be better to derivate from TransformerMixin instead of the SamplerMixin. We still have to think how we should incorporate it with the rest of the sampler.

Could you try to migrate to a full transformer?

glemaitre avatar Apr 22 '18 20:04 glemaitre

Sorry for the late reply. You're saying:

  1. I should create a class TransformerMixin in the base.py where SamplerMixIn resides.
  2. I should create a class BaseScaler in the base.py where BaseSampler resides.
  3. Remove the class BaseScaler from base.py in the /scaling directory.

thanks in advance!

BernhardSchlegel avatar Aug 06 '18 09:08 BernhardSchlegel

@BernhardSchlegel actually, you should inherit from sklearn.base.TransformerMixin. Since your method does not resamples I think that we should use the transform and fit_transform API. I believe that the method should be placed somewhere under the imblearn.preprocessing package.

chkoar avatar Mar 02 '19 16:03 chkoar

@BernhardSchlegel are you willing to finish this PR? It would be a nice addition to imbalanced-learn package.

chkoar avatar Nov 24 '19 17:11 chkoar

Yeah sure, just tell me what do :) Last time I tried I wasted my time.

BernhardSchlegel avatar Nov 24 '19 17:11 BernhardSchlegel

Yeah sure, just tell me what do :) Last time I tried I wasted my time.

This was the last proposal which still standing:

https://github.com/scikit-learn-contrib/imbalanced-learn/pull/416#issuecomment-468935416

It will a better candidate for a transformer than a samplwe

glemaitre avatar Nov 24 '19 18:11 glemaitre

Then we will then need to review with internal changes that could have occur since your last push. Looking at the PR, we also need user guide documentation to show what to expect from the method and when to use it and just add it to the API documentation

glemaitre avatar Nov 24 '19 18:11 glemaitre

@glemaitre we can close this PR?

chkoar avatar Jul 29 '20 09:07 chkoar