thundersvm
thundersvm copied to clipboard
Predicted class only has one label
Hello there, I am trying to use thundersvm on a text classification problem. I can run the test and get the 0.98 accuracy, so it seems that the library is working for test data. The problem is that when I want to use this on a text classification problem(e.g. 20 newsgroups dataset), I got very stange predictionand therefore low accuracy in comparison to sklearn SVC class. (In fact, y_pred is all "0" !). To demonstrate the problem, I have made a simple function to test it:
import time
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer, TfidfTransformer
import numpy as np
from sklearn import svm
import thundersvm
from sklearn.datasets import fetch_20newsgroups
def compare_sklearn_thunder(clf):
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(data_home='.', subset='train', categories=categories, shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(data_home='.', subset='test', categories=categories, shuffle=True, random_state=42)
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_counts = count_vect.fit_transform(twenty_train.data + twenty_test.data)
X_tfidf = tfidf_transformer.fit(X_counts)
X_train = X_tfidf.transform(X_counts[:len(twenty_train.data)])
X_test = X_tfidf.transform(X_counts[len(twenty_train.data):])
s_time = time.time()
print('X_train.shape = {}'.format(X_train.shape))
print('X_test.shape = {}'.format(X_test.shape))
clf.fit(X_train, twenty_train.target)
train_time = time.time() - s_time
print('Training Time = {:.4} seconds'.format(train_time))
y_pred = clf.predict(X_test).astype(int)
print(y_pred[:10])
print('Accuracy = {:.2}'.format(np.mean(y_pred == twenty_test.target)))
Now if I test the thundersvm.SVC()
:
compare_sklearn_thunder(thundersvm.SVC())
I got:
X_train.shape = (2257, 47319)
X_test.shape = (1502, 47319)
Training Time = 1.954 seconds
sample y_pred : [0 0 0 0 0 0 0 0 0 0]
sample y_true : [2 2 2 0 3 0 1 3 2 2]
Accuracy = 0.21
but when I test sklearn SVC:
compare_sklearn_thunder(svm.SVC())
It works fine:
X_train.shape = (2257, 47319)
X_test.shape = (1502, 47319)
Training Time = 8.247 seconds
sample y_pred : [2 2 2 0 3 0 1 3 1 1]
sample y_true : [2 2 2 0 3 0 1 3 2 2]
Accuracy = 0.88
How Can I solve this problem? Thanks in advance.
Hi @Salehoof ,
You can tune the parameters of SVC to get a good model. For example, if you trysvm.SVC(gamma=0.5, C=100)
, the accuracy of ThunderSVM is 0.9.