Scikit-learn model converted to ONNX results in different output shapes between Python and Java environments
Describe the bug
I was trying to train a scikit-learn model in Python, export it to ONNX and then use the model for prediction in a Java environment. The scikit-learn model was converted to ONNX using the skl2onnx Python package and loaded using ai.onnxruntime in Java.
However, the output was not complete when making predictions in Java. The predicted probabilities did not contain the entire probability array but only the first index of the array. When testing the ONNX model output in Python, the probability array was complete.
This issue persisted with different models (SVM, Logistic Regression, MLP) as well as different scikit-learn objects (a single classifier, a classifier as part of a Pipeline).
System information
- macOS 11.6.6
- JDK version: 11.0
- Python version: 3.9.5
- ONNX Runtime installed from (source or binary): pip
- ONNX Runtime version: 1.10.0
- ONNX version: 1.11.0
- skl2onnx version: 1.11.0
To Reproduce
- Export a scikit-learn model to ONNX using skl2onnx in Python
- Load ONNX model in Java using ai.onnxruntime
- Make a prediction
Expected behavior
- The expected output is a list of length 2
- The first index is the predicted class
- The second index should be the array of predicted probabilities for every class in the target variable. However, in Java, only the first index of the probability array is returned
How are you accessing the outputs in Java? And can you supply an example model?
I checked one of the test models (a logistic regression exported from scikit-learn), and I got the expected output of a tensor containing the predicted labels, and a sequence of maps containing the labels and probabilities for each example. There might be some oddities in the handling of the zipmap, so how was the model exported from scikit-learn?
The model was exported using the to_onnx method of skl2onnx. In the options, I specified "zipmap": False. When I do inference with the model on Python, I do get the full probability array.
On the Java side, I am mapping the probabilities output to an OnnxSequence. When I look at the info of the sequence, length=1. If I print it out, I get an array with only one value.
It looks like:
OrtSession.Result pred = session.run(onnxInputMap);
OnnxSequence probabilities = (OnnxSequence) pred.get(1);
System.out.println(probabilities);
System.out.println(probabilities.getValue());
Ok, I'll look at it. The spec for the behaviour of the onnx sequence seems underspecified, currently in Java it is written to pull out the first element of each tensor in the sequence because in the test examples I had all sequences contained a list of single element tensors. If it can in fact return a sequence of non-scalar tensors then I'll need to refactor the logic in Java and the native bridge.
I exported a logistic regression from scikit-learn using onx = to_onnx(lr,x.astype(numpy.float32),target_opset=12,options={type(model): {'zipmap': False}}) and when inspecting that model in Java I see that the output is a tensor:
jshell> session.getOutputInfo()
$11 ==> {
label=NodeInfo(name=label,info=TensorInfo(javaType=INT64,onnxType=ONNX_TENSOR_ELEMENT_DATA_TYPE_INT64,shape=[-1])),
probabilities=NodeInfo(name=probabilities,info=TensorInfo(javaType=FLOAT,onnxType=ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT,shape=[-1, 3]))
}
and those probabilities are all returned as outputs.
Can you provide more information so I can replicate this? Either an ONNX model or the exact series of commands you use to generate it?
I tried to share the model here but that file type does not seem to be supported. If there is an alternative place I can share it, please let me know.
However, the model is as such:
- the classifier is a logistic regression
- it is wrapped in a
sklearn.multioutput.MultiOutputClassifierobject - it is then wrapped in a
sklearn.pipeline.Pipelineobject (the first step was a vectorizer)
When I examined the output of the model in Java, I got the following:
label=NodeInfo(name=label,info=TensorInfo(javaType=INT64,onnxType=ONNX_TENSOR_ELEMENT_DATA_TYPE_INT64,shape=[-1, 1])), probabilities=NodeInfo(name=probabilities,info=SequenceInfo(length=UNKNOWN,type=FLOAT))}
However, when I tried exporting only a logistic regression, I got a similar output type as you for the probabilities.
I suspect the difference is with the MultiOutputClassifier and/or the Pipeline objects, and they are leading to the probabilities being cast to an OnnxSequence rather than an OnnxTensor.
You can email the model to me at [email protected], but I'm out on vacation next week and won't look at it until the week after.
Thanks! I sent you an email.
Could you test out this branch - https://github.com/Craigacp/onnxruntime/tree/java-sequence-fix. I've modified OnnxSequence.getValue() so it now returns either List<OnnxTensor> or List<Map<Object,Object>> depending on the sequence element type. It passes the ported C# test for operating on sequences of tensors, but I'm not entirely happy with the semantics as I think it should either return pure Java side values or Onnx values rather than a mixture. I'm leaning towards making it return List<OnnxTensor> and List<OnnxMap>, because forcing the creation of multidimensional java arrays ruins any performance in the system, but that's a bit more work than I have time for at the moment.
I found a little more time and moved OnnxSequence over so that getValue now returns List<? extends OnnxValue> which means both types of things supported in a sequence now behave the same way.