imbalanced-learn icon indicating copy to clipboard operation
imbalanced-learn copied to clipboard

[BUG] ValueError: Found array with 0 sample(s)

Open allenyllee opened this issue 5 years ago • 7 comments
trafficstars

Describe the bug

When using SVMSMOTE on dataset which contains a minority class which has very few samples (may be < 10), it'll raise error ValueError: Found array with 0 sample(s) (shape=(0, 600)) while a minimum of 1 is required.

Steps/Code to Reproduce

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SVMSMOTE # doctest: +NORMALIZE_WHITESPACE

X, y = make_classification(n_classes=3, class_sep=0,
            weights=[0.004, 0.451, 0.545], n_informative=3, n_redundant=0, flip_y=0,
            n_features=3, n_clusters_per_class=2, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))


sm = SVMSMOTE(random_state=42, k_neighbors=4)
X_res, y_res = sm.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

Expected Results

Running without error

Actual Results

Original dataset shape Counter({2: 544, 1: 451, 0: 5})

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-78-8f5d2308c2bd> in <module>()
     10 
     11 sm = SVMSMOTE(random_state=42, k_neighbors=4)
---> 12 X_res, y_res = sm.fit_resample(X, y)
     13 print('Resampled dataset shape %s' % Counter(y_res))

~/anaconda3/lib/python3.6/site-packages/imblearn/base.py in fit_resample(self, X, y)
     82             self.sampling_strategy, y, self._sampling_type)
     83 
---> 84         output = self._fit_resample(X, y)
     85 
     86         if binarize_y:

~/anaconda3/lib/python3.6/site-packages/imblearn/over_sampling/_smote.py in _fit_resample(self, X, y)
    530     def _fit_resample(self, X, y):
    531         # print("_fit_resample X shape", X.shape)
--> 532         return self._sample(X, y)
    533 
    534     def _sample(self, X, y):

~/anaconda3/lib/python3.6/site-packages/imblearn/over_sampling/_smote.py in _sample(self, X, y)
    569 
    570             danger_bool = self._in_danger_noise(
--> 571                 self.nn_m_, support_vector, class_sample, y, kind='danger')
    572             safety_bool = np.logical_not(danger_bool)
    573 

~/anaconda3/lib/python3.6/site-packages/imblearn/over_sampling/_smote.py in _in_danger_noise(self, nn_estimator, samples, target_class, y, kind)
    213         # print("kind", kind)
    214         # print("_in_danger_noise samples shape", samples.shape)
--> 215         x = nn_estimator.kneighbors(samples, return_distance=False)[:, 1:]
    216         # print("x", x)
    217         nn_label = (y[x] != target_class).astype(int)

~/anaconda3/lib/python3.6/site-packages/sklearn/neighbors/base.py in kneighbors(self, X, n_neighbors, return_distance)
    400         if X is not None:
    401             query_is_train = False
--> 402             X = check_array(X, accept_sparse='csr')
    403         else:
    404             query_is_train = True

~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    548                              " minimum of %d is required%s."
    549                              % (n_samples, array.shape, ensure_min_samples,
--> 550                                 context))
    551 
    552     if ensure_min_features > 0 and array.ndim == 2:

ValueError: Found array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.

Versions

System: python: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0] executable: /home/allenyl/anaconda3/bin/python machine: Linux-4.15.0-112-generic-x86_64-with-debian-buster-sid

Python deps: pip: 19.2.2 setuptools: 41.0.1 sklearn: 0.21.3 numpy: 1.15.1 scipy: 1.4.1 Cython: 0.28.2 pandas: 0.24.1

allenyllee avatar Aug 11 '20 08:08 allenyllee

Did you find a fix for this ? Having the same issue here

hiyamgh avatar Jan 31 '21 19:01 hiyamgh

@hiyamgh I've pushed a fix, but as @glemaitre's commented on #743, I need to add something before it can be merged. But currently I have no time to do it....

allenyllee avatar Feb 06 '21 12:02 allenyllee

Thank you @allenyllee for notifying me, from my side the error turned out to be that I was using SMOTENC, and in there, I was passing an empty list for the categorical_features parameter (did not know that the dataset must have a mix of numerical and categorical).

Here is the documentation

hiyamgh avatar Feb 06 '21 14:02 hiyamgh

Thank you @allenyllee for notifying me, from my side the error turned out to be that I was using SMOTENC, and in there, I was passing an empty list for the categorical_features parameter (did not know that the dataset must have a mix of numerical and categorical).

Here is the documentation

Hi @hiyamgh, I am having the same issue. Did you fix the problem? I am very new to the field. I can hardy follow #743

MontaseerAlam avatar Sep 13 '21 08:09 MontaseerAlam

Hi All! I have found this thread searching for a solution for identical problem. I have found that generally SMOTE-based algos might have a problem with oversampling extremely scarce class. ADASYN solved my problem.

szperajacyzolw avatar Dec 01 '21 09:12 szperajacyzolw

Is this fixed? I am having the same issue

nmshafie1993 avatar Feb 02 '22 20:02 nmshafie1993

This is present in: Python3.9.9 imbalanced-learn 0.9.0

4d30 avatar Feb 18 '22 22:02 4d30

Regarding the original use example, class_sep is really meaning that all data points are mixed. Therefore, the support vectors are categorized as noise. In this case, there is another solution than using another variant. In real-life, there actually no point to do machine learning in this case because the underlying classification predictor will be useless.

glemaitre avatar Jul 10 '23 15:07 glemaitre