sklearn-porter
sklearn-porter copied to clipboard
Prediction for ExtraTree model differs from sklearn (tested for C model)
I was trying to implement the predict_proba function for an Extra Tree model when I realized that the result returned by the transpiled version of the model differed from the one returned by sklearn.
My model contains 30 trees and 3 classes, below are the classes predicted by sklearn along side the probabilities for each estimator:
Proba Class 0 | Proba Class 1 | Proba Class 2 | Predicted class | |
---|---|---|---|---|
Estimator 0 | 0.1765 | 0.0000 | 0.8235 | 2 |
Estimator 1 | 0.0000 | 0.0000 | 1.0000 | 2 |
Estimator 2 | 0.1667 | 0.0000 | 0.8333 | 2 |
Estimator 3 | 0.6923 | 0.0000 | 0.3077 | 0 |
Estimator 4 | 0.8125 | 0.0417 | 0.1458 | 0 |
Estimator 5 | 0.8374 | 0.0064 | 0.1562 | 0 |
Estimator 6 | 0.9727 | 0.0000 | 0.0273 | 0 |
Estimator 7 | 0.3429 | 0.0000 | 0.6571 | 2 |
Estimator 8 | 0.8391 | 0.0095 | 0.1514 | 0 |
Estimator 9 | 0.0000 | 0.0000 | 1.0000 | 2 |
Estimator 10 | 0.7266 | 0.0078 | 0.2656 | 0 |
Estimator 11 | 0.6220 | 0.0000 | 0.3780 | 0 |
Estimator 12 | 0.5000 | 0.0000 | 0.5000 | 0 |
Estimator 13 | 0.6117 | 0.0000 | 0.3883 | 0 |
Estimator 14 | 0.0000 | 0.0000 | 1.0000 | 2 |
Estimator 15 | 0.8687 | 0.0000 | 0.1313 | 0 |
Estimator 16 | 1.0000 | 0.0000 | 0.0000 | 0 |
Estimator 17 | 0.8468 | 0.0170 | 0.1362 | 0 |
Estimator 18 | 0.5595 | 0.0000 | 0.4405 | 0 |
Estimator 19 | 0.0714 | 0.0000 | 0.9286 | 2 |
Estimator 20 | 0.4600 | 0.0000 | 0.5400 | 2 |
Estimator 21 | 0.0000 | 0.0000 | 1.0000 | 2 |
Estimator 22 | 0.5217 | 0.0000 | 0.4783 | 0 |
Estimator 23 | 0.8322 | 0.0049 | 0.1629 | 0 |
Estimator 24 | 0.5000 | 0.0000 | 0.5000 | 0 |
Estimator 25 | 0.3333 | 0.0000 | 0.6667 | 2 |
Estimator 26 | 1.0000 | 0.0000 | 0.0000 | 0 |
Estimator 27 | 0.4545 | 0.0000 | 0.5455 | 2 |
Estimator 28 | 0.0000 | 0.0000 | 1.0000 | 2 |
Estimator 29 | 0.0000 | 0.0000 | 1.0000 | 2 |
MODEL | 0.4916 | 0.0029 | 0.5055 | 2 |
17 estimators predict class 0 and 13 predict class 2 BUT the model predicts class 2 because it is the most probable class.
Therefore it seems to me that the transpiled model should also make its decision on the predicted probabilities.
What do you think?
Hello @LambertAn, thanks for your detailed report. Can you provide some data to reproduce the behaviour? And did you run the integrity check with integrity_score? What score did you get?
Thanks for getting back to me.
Below is code to build a 3-class extra tree classifier on random data.
from sklearn_porter import Porter
from sklearn.ensemble import ExtraTreesClassifier
import numpy as np
# Build random dataset
prng = np.random.RandomState(123)
X = prng.rand(50, 10)
y = prng.randint(0, 3, 50)
# Fit model
model = ExtraTreesClassifier(n_estimators=3, max_depth=3, random_state=prng)
model.fit(X, y)
# export:
porter = Porter(model, language='c')
output = porter.export(embed_data=True)
with open('extratree_randomdataset_original.c', 'w') as f_out:
f_out.write(output)
# accuracy:
integrity = porter.integrity_score(X)
print(integrity)
# Show details for one point
test_point = X[2:3]
for i in range(0, len(model.estimators_)):
print ("{}: {} -> {}".format(i, model.estimators_[i].predict_proba(test_point), model.estimators_[i].predict(test_point)))
print (model.predict_proba(test_point))
print (model.predict(test_point))
print (test_point)
The integrity score on the training data is 0.86. Let's look at the result for one of the data point: each estimator predicts a different class:
Estimator 0 predicts class 0 with probabilities [0.45 0.20 0.35] Estimator 1 predicts class 2 with probabilities [0.17 0.08 0.75] Estimator 2 predicts class 1 with probabilities [0.24 0.52 0.24]
The model predicts class 2 with probabilities [0.29 0.27 0.45].
I attached the above python code and 2 C files (the original model as generated by sklearn-porter and a modified version that calculates the probabilities for each estimator as well as the average for the model prediction):
For the above point the original 'predict' method returns class 0 and the new model 'predict_proba method returns: [0.29 0.27 0.45].
I hope it is enough to reproduce the problem.
Hello @LambertAn, we found a small bug and fixed it (release/0.7.0: Merge branch 'master' into release/0.7.0). Can you please reinstall the package and test it again?
pip uninstall -y sklearn-porter
pip install --no-cache-dir https://github.com/nok/sklearn-porter/zipball/master
Hi, I finally had some time to test but unfortunately this problem was not fixed. I used the python script above and had exactly the same results as before with an integrity score of 0.86.
I belive this is the same issue as https://github.com/nok/sklearn-porter/issues/52