imbalanced-learn [BUG]- error with SMOTENC fit_resample: ValueError: could not broadcast input array from shape (137,12) into shape (272,12

[BUG]- error with SMOTENC fit_resample: ValueError: could not broadcast input array from shape (137,12) into shape (272,12

Open jox79 opened this issue 3 years ago • 12 comments

Describe the bug

Error with SMOTENC.fit_resample: ValueError: could not broadcast input array from shape (137,12) into shape (272,12)

Steps/Code to Reproduce

Using the two X and y csv dataset attached:

X.zip y.zip

I'm running:

smote = SMOTENC(
  categorical_features=[19],
  sampling_strategy="auto",
  random_state=0,
  n_jobs=8
)
X, y = smote.fit_resample(X, y)

Expected Results

No error is thrown.

Actual Results

File "C:\Users\c42steguerri\PycharmProjects\StrategyLab\venv\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 577, in _generate_samples
    ] = self._X_categorical_minority_encoded
ValueError: could not broadcast input array from shape (137,12) into shape (272,12)

Versions

System:
    python: 3.7.7 (tags/v3.7.7:d7c567b08f, Mar 10 2020, 10:41:24) [MSC v.1900 64 bit (AMD64)]
executable: C:\Users\c42steguerri\PycharmProjects\StrategyLab\venv\Scripts\python.exe
   machine: Windows-10-10.0.16299-SP0

Python dependencies:
          pip: 19.0.3
   setuptools: 40.8.0
      sklearn: 0.24.1
        numpy: 1.18.4
        scipy: 1.4.1
       Cython: None
       pandas: 1.0.5
   matplotlib: None
       joblib: 0.14.1
threadpoolctl: 2.0.0

Built with OpenMP: True

May 10 '21 09:05 jox79

I'm having a similar issue with some code I'm testing. If I discover anything I'll let you know.

Jun 01 '21 18:06 SkylarTrigueiro

What are your imbalanced-learn versions?

Jun 01 '21 18:06 chkoar

@jox79 please post a code snippet in order to reproduce the error.

Jun 01 '21 18:06 chkoar

I'm having the same problem. I'm using imbalanced-learn version 0.8.0.

Jun 24 '21 14:06 jonasjostmann

I have found a rather unattractive workaround for the meantime. I choose sampling_strategy='minority' and loop over all labels.

smotenc = SMOTENC(
    categorical_features=[250],
    random_state=42,
    k_neighbors=5,
    sampling_strategy="minority",
)

for label in np.unique(y):
    X, y = smotenc.fit_resample(X, y)

Did I miss something?

Jun 24 '21 16:06 jonasjostmann

I'm still having this error also with v 0.8.1

File "C:\CRIF\StrategyOne\S170\wspace\lab\venv\lib\site-packages\imblearn\base.py", line 83, in fit_resample
    output = self._fit_resample(X, y)
  File "C:\CRIF\StrategyOne\S170\wspace\lab\venv\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 518, in _fit_resample
    X_resampled, y_resampled = super()._fit_resample(X_encoded, y)
  File "C:\CRIF\StrategyOne\S170\wspace\lab\venv\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 311, in _fit_resample
    X_class, y.dtype, class_sample, X_class, nns, n_samples, 1.0
  File "C:\CRIF\StrategyOne\S170\wspace\lab\venv\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 103, in _make_samples
    X_new = self._generate_samples(X, nn_data, nn_num, rows, cols, steps)
  File "C:\CRIF\StrategyOne\S170\wspace\lab\venv\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 577, in _generate_samples
    ] = self._X_categorical_minority_encoded
Exception: could not broadcast input array from shape (6,154) into shape (455,154)

I do not have idea how to solve it.....

Dec 02 '21 16:12 jox79

The issue here is that the internal algorithm was wrongly thought only for binary classification for the case when the median of the std. dev. == 0. This need to be adapted to multiclass. I assume that it boils down to _X_categorical_minority_encoded for all the classes to be over-sampled and not only the minority class.

Jan 16 '22 18:01 glemaitre

In short:

        # we can replace the 1 entries of the categorical features with the
        # median of the standard deviation. It will ensure that whenever
        # distance is computed between 2 samples, the difference will be equal
        # to the median of the standard deviation as in the original paper.

        # In the edge case where the median of the std is equal to 0, the 1s
        # entries will be also nullified. In this case, we store the original
        # categorical encoding which will be later used for inversing the OHE
        if math.isclose(self.median_std_, 0):
            self._X_categorical_minority_encoded = _safe_indexing(
                X_ohe.toarray(), np.flatnonzero(y == class_minority)
            )

Here, we need to store not only for the minority class but all class to be resampled.

Jan 16 '22 18:01 glemaitre

no way to have that issue fixed in one of the next releases? It is really important in my opinion. Thanks very much!

Jan 31 '22 17:01 jox79

@jox79 feel free to open a PR to fix the bug

Jan 31 '22 17:01 glemaitre

I put up a fix here @jox79 https://github.com/scikit-learn-contrib/imbalanced-learn/pull/905

Jun 03 '22 14:06 freddyaboulton

Hi everyone, can i check the status of this MR? I am facing the same error. However, its pretty random, sometimes it is able to run, sometimes it isn't. Please see the error log below. Thanks a lot!

Sep 05 '22 06:09 kelvinheng92

I got the same error this is the traceback

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_112/2018849994.py in <module>
      6 Y_validation = np.asarray(LabelEncoder().fit_transform(Y_validation))
      7 print(f"Y_type {type(Y_training)}\tshape Y_train {Y_training.shape}")
----> 8 X_training_rus, Y_training_rus = over_sampler.fit_resample(X_train_concat, Y_training)
      9 print("Sampled!")
     10 

/opt/conda/lib/python3.7/site-packages/imblearn/base.py in fit_resample(self, X, y)
     75         check_classification_targets(y)
     76         arrays_transformer = ArraysTransformer(X, y)
---> 77         X, y, binarize_y = self._check_X_y(X, y)
     78 
     79         self.sampling_strategy_ = check_sampling_strategy(

/opt/conda/lib/python3.7/site-packages/imblearn/over_sampling/_random_over_sampler.py in _check_X_y(self, X, y)
    144             accept_sparse=["csr", "csc"],
    145             dtype=None,
--> 146             force_all_finite=False,
    147         )
    148         return X, y, binarize_y

/opt/conda/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    430                 y = check_array(y, **check_y_params)
    431             else:
--> 432                 X, y = check_X_y(X, y, **check_params)
    433             out = X, y
    434 

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    800                     ensure_min_samples=ensure_min_samples,
    801                     ensure_min_features=ensure_min_features,
--> 802                     estimator=estimator)
    803     if multi_output:
    804         y = check_array(y, accept_sparse='csr', force_all_finite=True,

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    596                     array = array.astype(dtype, casting="unsafe", copy=False)
    597                 else:
--> 598                     array = np.asarray(array, order=order, dtype=dtype)
    599             except ComplexWarning:
    600                 raise ValueError("Complex data not supported\n"

/opt/conda/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     81 
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84 
     85

It looks like when internally its calling /opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)

there may be some parameter that need to be reset: the error is thrown by numpy when it calls array = np.asarray(array, order=order, dtype=dtype)

i checked my input by calling the same np.asarray() function

print(f"Y_type {type(Y_training)}\tshape Y_train {np.asarray(Y_training).shape}")

and it is:

Y_type <class 'numpy.ndarray'>	shape Y_train (56123,)

I was thinking maybe the force_all_finite or the ensure_2d arguments are the issue, even becasue we can read the lines:

/opt/conda/lib/python3.7/site-packages/imblearn/over_sampling/_random_over_sampler.py in _check_X_y(self, X, y)
    144             accept_sparse=["csr", "csc"],
    145             dtype=None,
--> 146             force_all_finite=False,
    147         )
    148         return X, y, binarize_y

from the traceback.

I dont know tho if this makes sense or could be helpful, i desperately need a fix to this hahaha

Dec 01 '22 16:12 lolloconsoli

It should be solved in https://github.com/scikit-learn-contrib/imbalanced-learn/pull/1015

Jul 10 '23 15:07 glemaitre

Hi @glemaitre, just wondering when this change is going to be released. I think it didn't make it in to 0.11.0 right? Seems like #1015 was merged a couple days after the last release?

Sep 07 '23 08:09 LukebethamStonehaven

It should aready be available in the latest release in 0.11

Sep 07 '23 08:09 glemaitre

Oh right I have updated to 0.11 and am still getting this error - it only seems to happen sometimes though...

Sep 07 '23 08:09 LukebethamStonehaven

It could be another bug with the same error. Don't hesitate to open a new issue with a minimal example that trigger the error.

Sep 07 '23 08:09 glemaitre

imbalanced-learn imbalanced-learn copied to clipboard

[BUG]- error with SMOTENC fit_resample: ValueError: could not broadcast input array from shape (137,12) into shape (272,12

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

imbalanced-learn
imbalanced-learn copied to clipboard