smote_variants icon indicating copy to clipboard operation
smote_variants copied to clipboard

Why I get this error when I use smote_variants?

Open ppleumyy opened this issue 2 years ago • 9 comments

This is my code:

vectorCount = CountVectorizer(tokenizer=tokenize)
X_trainCount = vectorCount.fit_transform(X_train)

tf_transformer = TfidfTransformer(use_idf=False)
tf_transformer.fit(X_trainCount)
X_trainTF = tf_transformer.transform(X_trainCount)

oversampler= sv.MulticlassOversampling(sv.distance_SMOTE())
X_res, y_res = oversampler.sample(X_trainTF,y_train)

and I get this error:

ValueError: provided out is the wrong size for the reduction

ppleumyy avatar Jun 24 '22 09:06 ppleumyy

Could you share the dimensions of X_trainTF and y_train?

gykovacs avatar Jun 24 '22 10:06 gykovacs

Could you share the dimensions of X_trainTF and y_train?

(4621, 2134) , (4621,)

@gykovacs

ppleumyy avatar Jun 24 '22 10:06 ppleumyy

Interesting, which version of Python and numpy are you using? There might have been some changes in the latest versions which have not been checked yet. (up to P3.9 were the tests executed, I should cover the most recent versions soon)

gykovacs avatar Jun 24 '22 10:06 gykovacs

@

Interesting, which version of Python and numpy are you using? There might have been some changes in the latest versions which have not been checked yet. (up to P3.9 were the tests executed, I should cover the most recent versions soon)

python version is 3.7.13 numpy version is 1.21.6

@gykovacs

ppleumyy avatar Jun 24 '22 10:06 ppleumyy

Cool, this is not the case then, it should work with this setup. If it is not much of a burden, could you please prepare a minimal working example, like replacing the X_trainTF and y_train with some random arrays of the same size, feed them into the MulticlassOversampling and see if it fails? I could use that as a minimal working example for debugging.

Also, could you please share the label distribution in y_train? Are the labels of integer type?

gykovacs avatar Jun 24 '22 10:06 gykovacs

Cool, this is not the case then, it should work with this setup. If it is not much of a burden, could you please prepare a minimal working example, like replacing the X_trainTF and y_train with some random arrays of the same size, feed them into the MulticlassOversampling and see if it fails? I could use that as a minimal working example for debugging.

Also, could you please share the label distribution in y_train? Are the labels of integer type?

this is my google colab workspace https://colab.research.google.com/drive/1ETmdFjWEJdayBq_Ji3Eu6qKprrc0lC_G?usp=sharing

and the dataset file: Suicidal_K1_Train.csv

@gykovacs

ppleumyy avatar Jun 24 '22 10:06 ppleumyy

Perfect, I look into it!

gykovacs avatar Jun 24 '22 10:06 gykovacs

Perfect, I look into it!

thank you very much!

@gykovacs

ppleumyy avatar Jun 24 '22 10:06 ppleumyy

Hi @ppleumyy, so, all the smote_variants tools operate on numerical arrays. Your y_train contains strings, and it is a pandas Series, while your X_trainTF is a sparse array (it needs to be dense). So with the following changes, everything seems to work as expected:

y_train[y_train == 'Level 1'] = 1
y_train[y_train == 'Level 2'] = 2
y_train[y_train == 'Level 3'] = 3
y_train[y_train == 'Level 4'] = 4
y_train[y_train == 'Level 5'] = 5
y_train[y_train == 'Other'] = 0

y_train= y_train.values

X_trainTF= X_trainTF.todense()

gykovacs avatar Jul 05 '22 17:07 gykovacs