libsvm icon indicating copy to clipboard operation
libsvm copied to clipboard

NU_SVC training is affected by order of training samples

Open alexisrozhkov opened this issue 7 years ago • 3 comments

I have originally discovered this effect using C version of libsvm, wrapped into C++ code, using real data.

Then I created a minimal snippet to reproduce it using libsvm in sklearn with random data for training and testing.

So far I have checked only NU_SVC and C_SVC, and it seems that C_SVC is invariant to order of training samples, but NU_SVC isn't, which is counterintuitive (at least for me, since there's no notion of sample ordering in SVM problem formulation).

Here's a snippet to reproduce:

import numpy as np
from sklearn import svm

# training/testing set params
n = 1000  # num of samples
d = 2  # num of dimensions

# random training set (balanced)
train_feats = np.random.uniform(size=(n, d))
train_labels = np.array([i < n/2 for i in xrange(n)]).reshape(n, 1)

# combine to retain connection between features and labels during shuffling
train = np.hstack([train_feats, train_labels])

# 2 svms to compare
csvm = svm.SVC(kernel='rbf',
               gamma=0.5, 
               degree=3,
               cache_size=100,
               tol=0.001, 
               C=1.0, 
               shrinking=False)

nsvm = svm.NuSVC(kernel='rbf',
                 gamma=0.5, 
                 degree=3,
                 cache_size=100,
                 tol=0.001, 
                 nu=0.1, 
                 shrinking=False)

# random testing set (balanced)
test_feats = np.random.uniform(size=(n, d))
test_labels = [i < n/2 for i in xrange(n)]

for i in xrange(10):
    # shuffle train set
    train_shuffled = train.copy()
    np.random.shuffle(train_shuffled)
    
    # split into features and labels
    shuffled_train_feats = train_shuffled[:, :-1]
    shuffled_train_labels = train_shuffled[:, -1]
    
    # train both svms individually
    csvm.fit(shuffled_train_feats, shuffled_train_labels)  
    nsvm.fit(shuffled_train_feats, shuffled_train_labels)  
    
    # make predictions individually
    pred_c = csvm.predict(test_feats)
    pred_n = nsvm.predict(test_feats)

    acc_c = sum(pred_c == test_labels)/float(n)
    acc_n = sum(pred_n == test_labels)/float(n)
    
    print acc_c, acc_n

And corresponding output:

0.502 0.49
0.502 0.495
0.502 0.504
0.502 0.5
0.502 0.526
0.502 0.496
0.502 0.505
0.502 0.507
0.502 0.533
0.502 0.516

First column - accuracies using C_SVC, second - NU_SVC using different shufflings of training samples. As you can see C_SVC is invariant to sample order, NU_SVC is not.

I would rather use NU_SVC, since it is simpler to perform grid search on, but this effect makes it almost pointless. Am I missing something?

alexisrozhkov avatar Jun 27 '17 09:06 alexisrozhkov

I think due to numerical inaccuracy, both c-svc and nu-svc may be slightly affected by the order of data. However, the resulting model (or test accuracy) should be similar

Alexey Rozhkov writes:

I have originally discovered this effect using C version of libsvm, wrapped into C++ code, using real data.

Then I created a minimal snippet to reproduce it using libsvm in sklearn with random data for training and testing.

So far I have checked only NU_SVC and C_SVC, and it seems that C_SVC is invariant to order of training samples, but NU_SVC isn't, which is counterintuitive (at least for me, since there's no notion of sample ordering in SVM problem formulation).

Here's a snippet to reproduce:

import numpy as np from sklearn import svm

training/testing set params

n = 1000 # num of samples d = 2 # num of dimensions

random training set (balanced)

train_feats = np.random.uniform(size=(n, d)) train_labels = np.array([i < n/2 for i in xrange(n)]).reshape(n, 1)

combine to retain connection between features and labels during shuffling

train = np.hstack([train_feats, train_labels])

2 svms to compare

csvm = svm.SVC(kernel='rbf', gamma=0.5, degree=3, cache_size=100, tol=0.001, C=1.0, shrinking=False)

nsvm = svm.NuSVC(kernel='rbf', gamma=0.5, degree=3, cache_size=100, tol=0.001, nu=0.1, shrinking=False)

random testing set (balanced)

test_feats = np.random.uniform(size=(n, d)) test_labels = [i < n/2 for i in xrange(n)]

for i in xrange(10): # shuffle train set train_shuffled = train.copy() np.random.shuffle(train_shuffled)

# split into features and labels
shuffled_train_feats = train_shuffled[:, :-1]
shuffled_train_labels = train_shuffled[:, -1]

# train both svms individually
csvm.fit(shuffled_train_feats, shuffled_train_labels)  
nsvm.fit(shuffled_train_feats, shuffled_train_labels)  

# make predictions individually
pred_c = csvm.predict(test_feats)
pred_n = nsvm.predict(test_feats)

acc_c = sum(pred_c == test_labels)/float(n)
acc_n = sum(pred_n == test_labels)/float(n)

print acc_c, acc_n

And corresponding output:

0.502 0.49 0.502 0.495 0.502 0.504 0.502 0.5 0.502 0.526 0.502 0.496 0.502 0.505 0.502 0.507 0.502 0.533 0.502 0.516

First column - accuracies using C_SVC, second - N_SVC using different shufflings of training samples. As you can see C_SVC is invariant to sample order, NU_SVC is not.

I would rather use NU_SVC, since it is simpler to perform grid search on, but this effect makes it almost pointless. Am I missing something?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.*

cjlin1 avatar Jun 28 '17 09:06 cjlin1

In this case it looks like a bug - accuracy variation is too high in the NU_SVC case

alexisrozhkov avatar Jun 28 '17 16:06 alexisrozhkov

csvc and nusvc are equivalent. Are you using the corresponding parameters? Alexey Rozhkov writes:

In this case it looks like a bug - accuracy variation is too high in the NU_SVC case

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.*

cjlin1 avatar Jun 28 '17 21:06 cjlin1