thundersvm Extremely Poor Performance

I am using the latest master branch of thundersvm. Compared to serial sklearn (LibSVM), thundersvm is orders of magnitude slower. I am probably doing something wrong though.

I have tested this on a GTX 750 Ti and 1060 Ti with the same results. I have to stop thundersvm because it just seems like it will never end, while the serial sklearn takes 0.5 seconds on 50 000 instances of data (5 features).

Here is my test code if you would like to try replicate this (dataset is BNG_COMET: https://www.openml.org/d/5648):

ThunderSVM test:

from thundersvm import SVC
import numpy as np
import time



data = np.loadtxt(open("atm/demos/BNG_COMET.csv", "rb"), delimiter=",", skiprows=1)
# data = np.loadtxt(open("atm/demos/pollution.csv", "rb"), delimiter=",", skiprows=1)


# print(data, data.shape)




X= data[:5000,:-1]

y = data[:5000,-1]


xp_lots_of_test_samples = data[5100:5103,:-1]

print("X",X, X.shape)


print(y)

start=time.time()




clf =  SVC(C=176.6677880062673, cache_size=150, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma=12005.61948153516, gpu_id=1,
    kernel='linear', max_iter=5, max_mem_size=-1, n_jobs=-1, probability=True,
    random_state=None, shrinking=True, tol=0.001, verbose=True)

clf.fit(X,y)



end_time =time.time()

totaltime = end_time-start

print('time: ',totaltime)

print("predictions:")
print(clf.predict(xp_lots_of_test_samples))
print("true labels:")

print(data[5100:5103,-1])

Sklearn test:

from sklearn import svm
import numpy as np
import time

# data = np.loadtxt(open("atm/demos/pollution.csv", "rb"), delimiter=",", skiprows=1)
data = np.loadtxt(open("atm/demos/BNG_COMET.csv", "rb"), delimiter=",", skiprows=1)


X= data[:50000,:-1]

y = data[:50000,-1]


xp_lots_of_test_samples = data[50100:50103,:-1]


# clf = svm.SVC(kernel='rbf',
#          verbose=True,
#          gamma=0.5, 
#          C=120.51564536384429, 
#          max_iter = 50000,
#          class_weight = 'balanced'
#                 )

start =time.time()
clf = svm.SVC(C=176.6677880062673, cache_size=150, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma=12005.61948153516,
    kernel='linear', max_iter=5, probability=True,
    random_state=None, shrinking=True, tol=0.001, verbose=True)

clf.fit(X,y)

end_time =time.time()

totaltime = end_time-start

print('time: ',totaltime)
print("predictions:")
print(clf.predict(xp_lots_of_test_samples))
print("true lables:")
# print(data[137:150,-1])
print(data[50100:50103,-1])

Oct 13 '19 13:10 beevabeeva

Update: This might be due to really bad hyperparameters being passed to ThunderSVM from the AutoML framework. A comment in the ATM source code suggests this:

Notes:

# - Support vector machines (svm) can take a long time to train. It's not an
#   error, it's just part of what happens when the method happens to explore
#   a crappy set of parameters on a powerful algo like this.

Having said that, this might not be the only issue causing the slow computation.

Oct 16 '19 19:10 beevabeeva

Thanks. We will look into the issue.

Some quick hints: hyper-parameters can affect convergence; data normalization also affects convergence. You may try to help us find out.

Oct 17 '19 03:10 zeyiwen

Any update on this? I would also be curious to learn if there are performance bottlenecks...

Feb 04 '20 10:02 emmenlau

ThunderSVM almost always works much better than the existing ones. The known poor performance of ThunderSVM is the convergence issue in some extreme cases (e.g., the values of each dimension vary from 0 to 10,000), and some extreme hyper-parameters can also affect the efficiency of SVMs (not only ThunderSVM).

Feb 05 '20 08:02 zeyiwen

I also have the same problem. Using the tabular playground of kaggle of feb 2021. `data_train = pd.read_csv('train.csv',sep=",").drop(columns=['id']) data_test = pd.read_csv('test.csv',sep=",").drop(columns=['id']) y = data_train['target'] X = data_train.drop(columns=['target'])

cats_name = [c for c in X.columns if 'cat' in c] cont_name = [c for c in X.columns if 'cont' in c] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) `

`column_trans = ColumnTransformer( [('cats',OneHotEncoder(),cats_name), ('conts',StandardScaler(),cont_name)], remainder='drop')

regr = TransformedTargetRegressor(regressor=svm.LinearSVR(epsilon=0.0, tol=0.0001, C=1.0, loss='squared_epsilon_insensitive', fit_intercept=True, intercept_scaling=1.0, dual=True, verbose=0, random_state=None, max_iter=2000), transformer=StandardScaler())

model = make_pipeline(column_trans,regr) model.fit(X_train, y_train) y_pred = model.predict(X_test) print(mean_squared_error(y_test,y_pred))` executed in 52.6s,

`column_trans = ColumnTransformer( [('cats',OneHotEncoder(),cats_name), ('conts',StandardScaler(),cont_name)], remainder='drop')

regr = TransformedTargetRegressor(regressor=SVR(kernel='linear',epsilon=0.0, tol=0.0001, C=1.0, verbose=0, max_iter=2000) ,transformer=StandardScaler())

model = make_pipeline(column_trans,regr) model.fit(X_train, y_train)` executed in 7m 4s

Feb 19 '21 08:02 TZDZ