scikit-learn-intelex
scikit-learn-intelex copied to clipboard
SVC performance drop if probability=True
Describe the bug daal4py SVC performance drops in first run and becomes slower than sklearn if probability=True passed to params.
To Reproduce Code to reproduce the behavior:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC as SKLSVC
from daal4py.sklearn.svm import SVC as D4PSVC
from timeit import default_timer as timer
params = {
'probability': True,
'C': 1.0,
'kernel': 'rbf',
'random_state': 42
}
x, y = make_classification(n_samples=4000, n_features=32, n_classes=2, random_state=42)
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2, random_state=42)
for i in range(5):
print(f'{i} run:')
for estimator_class in [SKLSVC, D4PSVC]:
t0 = timer()
clsf = estimator_class(**params)
clsf.fit(X_train, Y_train)
t1 = timer()
print(
str(estimator_class).split("'")[1].split(".")[0], '-',
round(t1 - t0, 6), 's', '-',
clsf.score(X_test, Y_test), 'acc'
)
Expected behavior daal4py SVC is faster than from sklearn
Output
0 run:
sklearn - 1.721927 s - 0.94 acc
daal4py - 18.420483 s - 0.94375 acc
1 run:
sklearn - 2.12422 s - 0.94 acc
daal4py - 0.207263 s - 0.94375 acc
2 run:
sklearn - 2.122007 s - 0.94 acc
daal4py - 0.202505 s - 0.94375 acc
3 run:
sklearn - 2.090547 s - 0.94 acc
daal4py - 0.210052 s - 0.94375 acc
4 run:
sklearn - 2.111722 s - 0.94 acc
daal4py - 0.204351 s - 0.94375 acc
Environment:
- OS: Red Hat Enterprise Linux Server release 7.9
- SW: scikit-learn 0.24.1, Python 3.8.8
- scikit-learn-intelex version: 2021.2.2
Similar here, CV+PCA+SVC runs 3-4 times slower with intelex. I use it with low amount of data (n<100), is it maybe a suboptimal scenario? Ubuntu, Python 3.8, intelex 2021.2.2.
@gorogm Thank you for reporting. We'll investigate it
@gorogm n<100 really small dimensions, how long do svm's work with scikit-learn, scikit-learn-intelex?
Can you share a script or dataset, then we can figure it out in more detail?
Hello @PetrovKP , I've fabricated this minimal code to demonstrate it:
import numpy as np
from sklearnex import patch_sklearn, unpatch_sklearn
import sklearn
X = np.random.random([100,100])
y = np.random.randint(0, 2, [100])
def measure():
for i in range(10): # fake 10 times repeated
for j in range(10): # fake 10 times param-search
for k in range(10): # fake 10-fold CV
g = sklearn.svm.SVC(probability=True, kernel='linear')
g.fit(X, y)
patch_sklearn()
%time measure()
# Prints:
# Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
# CPU times: user 11 s, sys: 769 ms, total: 11.8 s
# Wall time: 11.8 s
unpatch_sklearn()
%time measure()
# Prints:
# CPU times: user 4.33 s, sys: 0 ns, total: 4.33 s
# Wall time: 4.32 s
So it's almost 3 times slower. With probability=False
, results are much closer.