scikit-learn-intelex icon indicating copy to clipboard operation
scikit-learn-intelex copied to clipboard

SVC performance drop if probability=True

Open Alexsandruss opened this issue 3 years ago • 4 comments

Describe the bug daal4py SVC performance drops in first run and becomes slower than sklearn if probability=True passed to params.

To Reproduce Code to reproduce the behavior:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC as SKLSVC
from daal4py.sklearn.svm import SVC as D4PSVC
from timeit import default_timer as timer


params = {
    'probability': True,
    'C': 1.0,
    'kernel': 'rbf',
    'random_state': 42
}

x, y = make_classification(n_samples=4000, n_features=32, n_classes=2, random_state=42)
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2, random_state=42)

for i in range(5):
    print(f'{i} run:')
    for estimator_class in [SKLSVC, D4PSVC]:
        t0 = timer()
        clsf = estimator_class(**params)
        clsf.fit(X_train, Y_train)
        t1 = timer()
        print(
            str(estimator_class).split("'")[1].split(".")[0], '-',
            round(t1 - t0, 6), 's', '-',
            clsf.score(X_test, Y_test), 'acc'
        )

Expected behavior daal4py SVC is faster than from sklearn

Output

0 run:
sklearn - 1.721927 s - 0.94 acc
daal4py - 18.420483 s - 0.94375 acc
1 run:
sklearn - 2.12422 s - 0.94 acc
daal4py - 0.207263 s - 0.94375 acc
2 run:
sklearn - 2.122007 s - 0.94 acc
daal4py - 0.202505 s - 0.94375 acc
3 run:
sklearn - 2.090547 s - 0.94 acc
daal4py - 0.210052 s - 0.94375 acc
4 run:
sklearn - 2.111722 s - 0.94 acc
daal4py - 0.204351 s - 0.94375 acc

Environment:

  • OS: Red Hat Enterprise Linux Server release 7.9
  • SW: scikit-learn 0.24.1, Python 3.8.8
  • scikit-learn-intelex version: 2021.2.2

Alexsandruss avatar Apr 14 '21 13:04 Alexsandruss

Similar here, CV+PCA+SVC runs 3-4 times slower with intelex. I use it with low amount of data (n<100), is it maybe a suboptimal scenario? Ubuntu, Python 3.8, intelex 2021.2.2.

gorogm avatar May 27 '21 08:05 gorogm

@gorogm Thank you for reporting. We'll investigate it

owerbat avatar May 27 '21 09:05 owerbat

@gorogm n<100 really small dimensions, how long do svm's work with scikit-learn, scikit-learn-intelex?

Can you share a script or dataset, then we can figure it out in more detail?

PetrovKP avatar May 27 '21 12:05 PetrovKP

Hello @PetrovKP , I've fabricated this minimal code to demonstrate it:

import numpy as np
from sklearnex import patch_sklearn, unpatch_sklearn
import sklearn

X = np.random.random([100,100])
y = np.random.randint(0, 2, [100])

def measure():
    for i in range(10): # fake 10 times repeated
        for j in range(10): # fake 10 times param-search
            for k in range(10): # fake 10-fold CV
                g = sklearn.svm.SVC(probability=True, kernel='linear')
                g.fit(X, y)

patch_sklearn()
%time measure()
# Prints:
# Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
# CPU times: user 11 s, sys: 769 ms, total: 11.8 s
# Wall time: 11.8 s

unpatch_sklearn()
%time measure()
# Prints:
# CPU times: user 4.33 s, sys: 0 ns, total: 4.33 s
# Wall time: 4.32 s

So it's almost 3 times slower. With probability=False, results are much closer.

gorogm avatar May 27 '21 15:05 gorogm