onnxmltools
onnxmltools copied to clipboard
SparkML random forest classifier test not working.
I was trying to run the test on my local spark but the code is not working. I've pasted the exact code which I ran down below and it breaks at the last line, compare_results(expected, output, decimal=5)
. Almost all of the code below is copy-pasted from the actual test here.
import sys
import inspect
import unittest
import os
from distutils.version import StrictVersion
import onnx
import pandas
import numpy
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.linalg import VectorUDT, SparseVector
from onnxmltools import convert_sparkml
from onnxmltools.convert.common.data_types import StringTensorType, FloatTensorType
from tests.sparkml.sparkml_test_utils import save_data_models, run_onnx_model, compare_results
from tests.sparkml import SparkMlTestCase
from pyspark.ml.feature import StringIndexer, VectorIndexer
sc = SparkContext()
spark = SparkSession(sc)
original_data = spark.read.format("libsvm").load("/Users/sanashar/sample.txt")
feature_count = 5
spark.udf.register("truncateFeatures",
lambda x: SparseVector(feature_count, range(0,feature_count), x.toArray()[125:130]),
VectorUDT())
data = original_data.selectExpr("cast(label as string) as label", "truncateFeatures(features) as features")
label_indexer = StringIndexer(inputCol="label", outputCol="indexedLabel")
feature_indexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures",
maxCategories=10, handleInvalid='keep')
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)
pipeline = Pipeline(stages=[label_indexer, feature_indexer, rf])
model = pipeline.fit(data)
model_onnx = convert_sparkml(model, 'Sparkml RandomForest Classifier', [
('label', StringTensorType([1, 1])),
('features', FloatTensorType([1, feature_count]))
], spark_session=spark)
predicted = model.transform(data)
data_np = {
'label': data.toPandas().label.values,
'features': data.toPandas().features.apply(lambda x: pandas.Series(x.toArray())).values.astype(numpy.float32)
}
expected = [
predicted.toPandas().indexedLabel.values.astype(numpy.int64),
predicted.toPandas().prediction.values.astype(numpy.float32),
predicted.toPandas().probability.apply(lambda x: pandas.Series(x.toArray())).values.astype(numpy.float32)
]
paths = save_data_models(data_np, expected, model, model_onnx,
basename="SparkmlRandomForestClassifier")
onnx_model_path = paths[3]
output, output_shapes = run_onnx_model(['indexedLabel', 'prediction', 'probability'], data_np, onnx_model_path)
compare_results(expected, output, decimal=5)
Since, this was not working out, I wrote a little line to compare predictions myself, output[1] == expected[1]
, which showed that the expected and the outputs obtained through onnxruntime don't match. Also, sometimes my kernel dies at the run_onnx_model
call, which is weird too.
I'm not sure what's going on here and any help would be appreciated.
I can't replicate the issue. Do you obtain the same error with the unit test? Which version of pyspark, onnxruntime, onnx, onnxmltools are you using?
I'm using Spark 2.4.3, onnxruntime 0.5.0, onnxmltools 1.5.0, onnx 1.5.0. I didn't run the actual test, just copied code from there on my local and downloaded the required file "sample.txt".
I'm also using Python 3.7.3