sklearn-porter
sklearn-porter copied to clipboard
Decision tree C code exported by porter has wrong datatype for features array it should be float
C code exported by porter has wrong data type for feature value as double which will cause accuracy percentage.
scikit-learn code
def predict(self, X, check_input=True):
"""Predict class or regression value for X.
For a classification model, the predicted class for each sample in X is
returned. For a regression model, the predicted value based on X is
returned.
Parameters
----------
X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csr_matrix``.
check_input : boolean, (default=True)
Allow to bypass several input checking.
Don't use this parameter unless you know what you do.
Returns
-------
y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted classes, or the predict values.
"""
porter C Code:
int main(int argc, const char * argv[]) {{
/* Features: */
double features[argc-1];
int i;
for (i = 1; i < argc; i++) {{
features[i-1] = atof(argv[i]);
}}
/* Prediction: */
printf("%d", {method_name}(features, 0));
return 0;
}}
Can you please provide some data and code for comparison?
(There is a bigger difference between the internal and textual representation of values in Python I guess.)
ok I will provide detail example/data tomorrow.
attaching zip file contains
- C program trained for 10000 records with accepting feature float data type
- C program trained for 10000 records with accepting feature double data type
- Shell script used to calculate the matched records of target binary of above programs
- Test data set file
- Expected prediction data file porter_attachments.zip
- csv file used for training (First column as Target class, and rest of the column as test data set) train_10000.zip
test script output at my end
./test_prediction.sh ./train_10000 ./train_10000_target ./porter_train_10000_double
test data file - test_data/train_10000
expected prediction data file - test_data/train_10000_target
testing output binray by feeding training data .......
Total records - 10000
Matched prediction records - 9878
./test_prediction.sh ./train_10000 ./train_10000_target ./porter_train_10000 _float
test data file - test_data/train_10000
expected prediction data file - test_data/train_10000_target
testing output binray by feeding training data .......
Total records - 10000
Matched prediction records - 9992
Okay, thanks. Can you please validate the data type of your training data?
print(type(X[0])) # <type 'numpy.float32'> or <type 'numpy.float64'>
For load_digits it's numpy.float64
which is double
in C. The integrity check finished without mismatches. So I changed the data to floats with X.astype(np.float32)
and finished the integrity check again without errors.
Nevertheless it depends on the data. In general I see the problem of point precisions between data types and programming languages. It could make sense to add a possibility to change the features data type in transpiled output by using a new argument temp_dtype='float'
.
Further atof()
converts a string to double
in C. On the other hand if you want to use floats, you should use strtof()
to convert strings to float
.
Can you test it?