imbalanced-learn icon indicating copy to clipboard operation
imbalanced-learn copied to clipboard

[BUG] The estimator_ in CondensedNearestNeighbour() is incorrect for multiple classes

Open idc9 opened this issue 2 years ago • 0 comments

Describe the bug

The estimator_ object fit by CondensedNearestNeighbour() (and probably other sampling strategies) is incorrect when y has multiple classes (and possibly also for binary classes). In particular, the estimator is only fit to a subset of 2 of the classes.

Steps/Code to Reproduce

from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from imblearn.under_sampling import CondensedNearestNeighbour

n_clusters = 10
X, y = make_blobs(n_samples=2000, centers=n_clusters, n_features=2, cluster_std=.5, random_state=0)

n_neighbors = 1
condenser = CondensedNearestNeighbour(sampling_strategy='all', n_neighbors=n_neighbors)
X_cond, y_cond = condenser.fit_resample(X, y)
print('condenser.estimator_.classes_', condenser.estimator_.classes_) # this should have 10 classes, which it does!
print("condenser.estomator_ accuracy", condenser.estimator_.score(X, y))
condenser.estimator_.classes_ [5 9]
condenser.estomator_ accuracy 0.2
# I think the estimator we want should look like this
knn_cond_manual = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_cond, y_cond)
print('knn_cond_manual.classes_', knn_cond_manual.classes_)  # yes 10 classes!
print("Manual KNN on condensted data accuracy", knn_cond_manual.score(X, y)) # good accuracy!
knn_cond_manual.classes_ [0 1 2 3 4 5 6 7 8 9]
Manual KNN on condensted data accuracy 0.996

The issue

The issue that we set estimator_ in each run of the loop in _fit_resample e.g. this line. We should really set estimator_ after the loop ends on the condensed datasets.

This looks like it's also an issue with OneSidedSelection and possibly other samplers.

Fix

I think we should just add the following to directly before the return statement in fit_resample

X_condensed, y_condensed = _safe_indexing(X, idx_under), _safe_indexing(y, idx_under)
self.estimator_.fit(X_condensed, y_condensed)
return X_condensed, y_condensed

Versions


System:
    python: 3.8.12 (default, Oct 12 2021, 06:23:56)  [Clang 10.0.0 ]
executable: /Users/iaincarmichael/anaconda3/envs/comp_onc/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.1.1
          pip: 21.2.4
   setuptools: 58.0.4
        numpy: 1.21.4
        scipy: 1.7.3
       Cython: 0.29.25
       pandas: 1.3.5
   matplotlib: 3.5.0
       joblib: 1.1.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/python3.8/site-packages/sklearn/.dylibs/libomp.dylib
         prefix: libomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8

       filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/python3.8/site-packages/numpy/.dylibs/libopenblas.0.dylib
         prefix: libopenblas
       user_api: blas
   internal_api: openblas
        version: 0.3.17
    num_threads: 4
threading_layer: pthreads
   architecture: Haswell

       filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/libmkl_rt.1.dylib
         prefix: libmkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 4
threading_layer: intel

       filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/libomp.dylib
         prefix: libomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8

idc9 avatar Jun 09 '22 21:06 idc9