imbalanced-learn
imbalanced-learn copied to clipboard
[BUG]- error with SMOTENC fit_resample: ValueError: could not broadcast input array from shape (137,12) into shape (272,12
Describe the bug
Error with SMOTENC.fit_resample
: ValueError: could not broadcast input array from shape (137,12) into shape (272,12)
Steps/Code to Reproduce
Using the two X and y csv dataset attached:
I'm running:
smote = SMOTENC(
categorical_features=[19],
sampling_strategy="auto",
random_state=0,
n_jobs=8
)
X, y = smote.fit_resample(X, y)
Expected Results
No error is thrown.
Actual Results
File "C:\Users\c42steguerri\PycharmProjects\StrategyLab\venv\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 577, in _generate_samples
] = self._X_categorical_minority_encoded
ValueError: could not broadcast input array from shape (137,12) into shape (272,12)
Versions
System:
python: 3.7.7 (tags/v3.7.7:d7c567b08f, Mar 10 2020, 10:41:24) [MSC v.1900 64 bit (AMD64)]
executable: C:\Users\c42steguerri\PycharmProjects\StrategyLab\venv\Scripts\python.exe
machine: Windows-10-10.0.16299-SP0
Python dependencies:
pip: 19.0.3
setuptools: 40.8.0
sklearn: 0.24.1
numpy: 1.18.4
scipy: 1.4.1
Cython: None
pandas: 1.0.5
matplotlib: None
joblib: 0.14.1
threadpoolctl: 2.0.0
Built with OpenMP: True
I'm having a similar issue with some code I'm testing. If I discover anything I'll let you know.
What are your imbalanced-learn
versions?
@jox79 please post a code snippet in order to reproduce the error.
I'm having the same problem. I'm using imbalanced-learn version 0.8.0.
I have found a rather unattractive workaround for the meantime. I choose sampling_strategy='minority'
and loop over all labels.
smotenc = SMOTENC(
categorical_features=[250],
random_state=42,
k_neighbors=5,
sampling_strategy="minority",
)
for label in np.unique(y):
X, y = smotenc.fit_resample(X, y)
Did I miss something?
I'm still having this error also with v 0.8.1
File "C:\CRIF\StrategyOne\S170\wspace\lab\venv\lib\site-packages\imblearn\base.py", line 83, in fit_resample
output = self._fit_resample(X, y)
File "C:\CRIF\StrategyOne\S170\wspace\lab\venv\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 518, in _fit_resample
X_resampled, y_resampled = super()._fit_resample(X_encoded, y)
File "C:\CRIF\StrategyOne\S170\wspace\lab\venv\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 311, in _fit_resample
X_class, y.dtype, class_sample, X_class, nns, n_samples, 1.0
File "C:\CRIF\StrategyOne\S170\wspace\lab\venv\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 103, in _make_samples
X_new = self._generate_samples(X, nn_data, nn_num, rows, cols, steps)
File "C:\CRIF\StrategyOne\S170\wspace\lab\venv\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 577, in _generate_samples
] = self._X_categorical_minority_encoded
Exception: could not broadcast input array from shape (6,154) into shape (455,154)
I do not have idea how to solve it.....
The issue here is that the internal algorithm was wrongly thought only for binary classification for the case when the median of the std. dev. == 0. This need to be adapted to multiclass. I assume that it boils down to _X_categorical_minority_encoded
for all the classes to be over-sampled and not only the minority class.
In short:
# we can replace the 1 entries of the categorical features with the
# median of the standard deviation. It will ensure that whenever
# distance is computed between 2 samples, the difference will be equal
# to the median of the standard deviation as in the original paper.
# In the edge case where the median of the std is equal to 0, the 1s
# entries will be also nullified. In this case, we store the original
# categorical encoding which will be later used for inversing the OHE
if math.isclose(self.median_std_, 0):
self._X_categorical_minority_encoded = _safe_indexing(
X_ohe.toarray(), np.flatnonzero(y == class_minority)
)
Here, we need to store not only for the minority class but all class to be resampled.
no way to have that issue fixed in one of the next releases? It is really important in my opinion. Thanks very much!
@jox79 feel free to open a PR to fix the bug
I put up a fix here @jox79 https://github.com/scikit-learn-contrib/imbalanced-learn/pull/905
Hi everyone, can i check the status of this MR? I am facing the same error. However, its pretty random, sometimes it is able to run, sometimes it isn't. Please see the error log below. Thanks a lot!
I got the same error this is the traceback
ValueError Traceback (most recent call last)
/tmp/ipykernel_112/2018849994.py in <module>
6 Y_validation = np.asarray(LabelEncoder().fit_transform(Y_validation))
7 print(f"Y_type {type(Y_training)}\tshape Y_train {Y_training.shape}")
----> 8 X_training_rus, Y_training_rus = over_sampler.fit_resample(X_train_concat, Y_training)
9 print("Sampled!")
10
/opt/conda/lib/python3.7/site-packages/imblearn/base.py in fit_resample(self, X, y)
75 check_classification_targets(y)
76 arrays_transformer = ArraysTransformer(X, y)
---> 77 X, y, binarize_y = self._check_X_y(X, y)
78
79 self.sampling_strategy_ = check_sampling_strategy(
/opt/conda/lib/python3.7/site-packages/imblearn/over_sampling/_random_over_sampler.py in _check_X_y(self, X, y)
144 accept_sparse=["csr", "csc"],
145 dtype=None,
--> 146 force_all_finite=False,
147 )
148 return X, y, binarize_y
/opt/conda/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
430 y = check_array(y, **check_y_params)
431 else:
--> 432 X, y = check_X_y(X, y, **check_params)
433 out = X, y
434
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
800 ensure_min_samples=ensure_min_samples,
801 ensure_min_features=ensure_min_features,
--> 802 estimator=estimator)
803 if multi_output:
804 y = check_array(y, accept_sparse='csr', force_all_finite=True,
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
596 array = array.astype(dtype, casting="unsafe", copy=False)
597 else:
--> 598 array = np.asarray(array, order=order, dtype=dtype)
599 except ComplexWarning:
600 raise ValueError("Complex data not supported\n"
/opt/conda/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
81
82 """
---> 83 return array(a, dtype, copy=False, order=order)
84
85
It looks like when internally its calling /opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
there may be some parameter that need to be reset:
the error is thrown by numpy when it calls array = np.asarray(array, order=order, dtype=dtype)
i checked my input by calling the same np.asarray()
function
print(f"Y_type {type(Y_training)}\tshape Y_train {np.asarray(Y_training).shape}")
and it is:
Y_type <class 'numpy.ndarray'> shape Y_train (56123,)
I was thinking maybe the force_all_finite
or the ensure_2d
arguments are the issue, even becasue we can read the lines:
/opt/conda/lib/python3.7/site-packages/imblearn/over_sampling/_random_over_sampler.py in _check_X_y(self, X, y)
144 accept_sparse=["csr", "csc"],
145 dtype=None,
--> 146 force_all_finite=False,
147 )
148 return X, y, binarize_y
from the traceback.
I dont know tho if this makes sense or could be helpful, i desperately need a fix to this hahaha
It should be solved in https://github.com/scikit-learn-contrib/imbalanced-learn/pull/1015
Hi @glemaitre, just wondering when this change is going to be released. I think it didn't make it in to 0.11.0 right? Seems like #1015 was merged a couple days after the last release?
It should aready be available in the latest release in 0.11
Oh right I have updated to 0.11 and am still getting this error - it only seems to happen sometimes though...
It could be another bug with the same error. Don't hesitate to open a new issue with a minimal example that trigger the error.