imbalanced-learn
imbalanced-learn copied to clipboard
[BUG] The estimator_ in CondensedNearestNeighbour() is incorrect for multiple classes
Describe the bug
The estimator_ object fit by CondensedNearestNeighbour()
(and probably other sampling strategies) is incorrect when y has multiple classes (and possibly also for binary classes). In particular, the estimator is only fit to a subset of 2 of the classes.
Steps/Code to Reproduce
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from imblearn.under_sampling import CondensedNearestNeighbour
n_clusters = 10
X, y = make_blobs(n_samples=2000, centers=n_clusters, n_features=2, cluster_std=.5, random_state=0)
n_neighbors = 1
condenser = CondensedNearestNeighbour(sampling_strategy='all', n_neighbors=n_neighbors)
X_cond, y_cond = condenser.fit_resample(X, y)
print('condenser.estimator_.classes_', condenser.estimator_.classes_) # this should have 10 classes, which it does!
print("condenser.estomator_ accuracy", condenser.estimator_.score(X, y))
condenser.estimator_.classes_ [5 9]
condenser.estomator_ accuracy 0.2
# I think the estimator we want should look like this
knn_cond_manual = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_cond, y_cond)
print('knn_cond_manual.classes_', knn_cond_manual.classes_) # yes 10 classes!
print("Manual KNN on condensted data accuracy", knn_cond_manual.score(X, y)) # good accuracy!
knn_cond_manual.classes_ [0 1 2 3 4 5 6 7 8 9]
Manual KNN on condensted data accuracy 0.996
The issue
The issue that we set estimator_
in each run of the loop in _fit_resample
e.g. this line. We should really set estimator_
after the loop ends on the condensed datasets.
This looks like it's also an issue with OneSidedSelection and possibly other samplers.
Fix
I think we should just add the following to directly before the return statement in fit_resample
X_condensed, y_condensed = _safe_indexing(X, idx_under), _safe_indexing(y, idx_under)
self.estimator_.fit(X_condensed, y_condensed)
return X_condensed, y_condensed
Versions
System:
python: 3.8.12 (default, Oct 12 2021, 06:23:56) [Clang 10.0.0 ]
executable: /Users/iaincarmichael/anaconda3/envs/comp_onc/bin/python
machine: macOS-10.16-x86_64-i386-64bit
Python dependencies:
sklearn: 1.1.1
pip: 21.2.4
setuptools: 58.0.4
numpy: 1.21.4
scipy: 1.7.3
Cython: 0.29.25
pandas: 1.3.5
matplotlib: 3.5.0
joblib: 1.1.0
threadpoolctl: 2.2.0
Built with OpenMP: True
threadpoolctl info:
filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/python3.8/site-packages/sklearn/.dylibs/libomp.dylib
prefix: libomp
user_api: openmp
internal_api: openmp
version: None
num_threads: 8
filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/python3.8/site-packages/numpy/.dylibs/libopenblas.0.dylib
prefix: libopenblas
user_api: blas
internal_api: openblas
version: 0.3.17
num_threads: 4
threading_layer: pthreads
architecture: Haswell
filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/libmkl_rt.1.dylib
prefix: libmkl_rt
user_api: blas
internal_api: mkl
version: 2021.4-Product
num_threads: 4
threading_layer: intel
filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/libomp.dylib
prefix: libomp
user_api: openmp
internal_api: openmp
version: None
num_threads: 8