NimbusML icon indicating copy to clipboard operation
NimbusML copied to clipboard

Output of Label Column when applying ONNX model is not as expected

Open antoniovs1029 opened this issue 4 years ago • 1 comments

When creating ONNX models for classifiers, using NimbusML, and then applying them either with OnnxRunner (aka OnnxTransformer from ML.NET) or directly using Onnx runtime (aka ORT) python's API, then we get unexpected values in the Label column (i.e. the column that was used as Label for the classifier).

The behavior is somewhat different if the input DataFrame's Label column is category, object (string) or float (as I show in my repro below, but I guess similar problems arise for different types). There are two main issues: Issue 1. When running with ORT, the output Label column from the ONNX model, is 'keys' and not 'values'... i.e. we get integers starting from 0, instead of whatever original values there where in Label. This happens regardless of the input Label column type. Issue 2. When running with OnnxRunner, the Label column has weird values. If the input Label column was object (string), then, for all rows, the value in that column is "4294967295"... if the input was category or float, then the value is "0".

Repro

NOTE: the data_frame_tool module used is the one currently in the aml branch (link)

import os
import tempfile
from data_frame_tool import DataFrameTool as DFT
from nimbusml.datasets import get_dataset
from nimbusml.linear_model import FastLinearClassifier
from nimbusml.preprocessing import OnnxRunner
from nimbusml.preprocessing import FromKey, ToKey
from nimbusml import Pipeline

def get_tmp_file(suffix=None):
    fd, file_name = tempfile.mkstemp(suffix=suffix)
    fl = os.fdopen(fd, 'w')
    fl.close()
    return file_name

# Change the label column to see different behaviors:
LABEL_COLUMN_NAME = "Species" # Type: object (string)
#LABEL_COLUMN_NAME = "Setosa" # Type: float
#LABEL_COLUMN_NAME = "Label" # Type: category

iris_df = get_dataset("iris").as_df()
print("\n\nORIGINAL DATASET - using", LABEL_COLUMN_NAME, " as Label column")
print(iris_df)
print(iris_df.dtypes)

predictor = FastLinearClassifier(feature=["Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width"], label=LABEL_COLUMN_NAME)
predictor.fit(iris_df)

print("\n\nML.NET RESULT")
original_result = predictor.predict(iris_df) # Notice this outputs only "PredictedLabel" so the user can't get the Label column after applying the predictor. QUESTION: Is there a way for the user to get that column after the predictor?
print(predictor.model_)
print(original_result)
print(original_result.dtypes)

# onnxpath = get_tmp_file()
onnxpath = get_tmp_file()
print()
print("Onnx model path:", onnxpath)
predictor.export_to_onnx(onnxpath, 'com.microsoft.ml')

print("\n\nORT RESULT")
df_tool = DFT(onnxpath)
result_ort = df_tool.execute(iris_df, [])
print(result_ort)
print("\nColumn:", LABEL_COLUMN_NAME, " - ORT RESULT") # Issue 1: It prints the "keys", instead of values for the Label column
print(result_ort[LABEL_COLUMN_NAME + ".output"])

print("\n\nONNX RUNNER RESULT")
onnxrunner = OnnxRunner(model_file=onnxpath)
result_onnx = onnxrunner.fit_transform(iris_df)
print(result_onnx)
print(result_onnx.dtypes)
print("\nColumn:", LABEL_COLUMN_NAME, " - ONNX RUNNER RESULT") # Issue 2: It prints "4294967295" when label column is "Species" (string), "0" when label column is "Label" (category) and "Setosa" (float), for every row
print(result_onnx[LABEL_COLUMN_NAME])

Output (for LABEL_COLUMN_NAME="Species")

ORIGINAL DATASET - using Species  as Label column
     Sepal_Length  Sepal_Width  Petal_Length  Petal_Width Label    Species  Setosa
0             5.1          3.5           1.4          0.2     0     setosa     1.0
1             4.9          3.0           1.4          0.2     0     setosa     1.0
2             4.7          3.2           1.3          0.2     0     setosa     1.0
3             4.6          3.1           1.5          0.2     0     setosa     1.0
4             5.0          3.6           1.4          0.2     0     setosa     1.0
..            ...          ...           ...          ...   ...        ...     ...
145           6.7          3.0           5.2          2.3     2  virginica     0.0
146           6.3          2.5           5.0          1.9     2  virginica     0.0
147           6.5          3.0           5.2          2.0     2  virginica     0.0
148           6.2          3.4           5.4          2.3     2  virginica     0.0
149           5.9          3.0           5.1          1.8     2  virginica     0.0

[150 rows x 7 columns]
Sepal_Length     float64
Sepal_Width      float64
Petal_Length     float64
Petal_Width      float64
Label           category
Species           object
Setosa           float64
dtype: object
Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Using 6 threads to train.
Automatically choosing a check frequency of 6.
Auto-tuning parameters: maxIterations = 9996.
Auto-tuning parameters: L2 = 2.667734E-05.
Auto-tuning parameters: L1Threshold (L1/L2) = 0.
Using best model from iteration 948.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:00.9079426


ML.NET RESULT
C:\Users\anvelazq\AppData\Local\Temp\tmp7b539j8w.model.bin
0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
         ...
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: PredictedLabel, Length: 150, dtype: object
object

Onnx model path: C:\Users\anvelazq\Desktop\is23repros\model-labelissue.onnx


ORT RESULT
     Sepal_Length.output  Sepal_Width.output  Petal_Length.output  ...  Score.output.0 Score.output.1  Score.output.2
0                    5.1                 3.5                  1.4  ...    9.979612e-01       0.002039    7.896303e-15
1                    4.9                 3.0                  1.4  ...    9.935742e-01       0.006426    1.243418e-13
2                    4.7                 3.2                  1.3  ...    9.969639e-01       0.003036    2.946764e-14
3                    4.6                 3.1                  1.5  ...    9.950643e-01       0.004936    1.473649e-13
4                    5.0                 3.6                  1.4  ...    9.984953e-01       0.001505    4.957718e-15
..                   ...                 ...                  ...  ...             ...            ...             ...
145                  6.7                 3.0                  5.2  ...    6.576003e-09       0.002802    9.971976e-01
146                  6.3                 2.5                  5.0  ...    3.143095e-07       0.031589    9.684103e-01
147                  6.5                 3.0                  5.2  ...    4.240965e-07       0.031176    9.688237e-01
148                  6.2                 3.4                  5.4  ...    1.435240e-08       0.002293    9.977069e-01
149                  5.9                 3.0                  5.1  ...    7.885213e-06       0.121532    8.784599e-01

[150 rows x 19 columns]

Column: Species  - ORT RESULT
0      1
1      1
2      1
3      1
4      1
      ..
145    3
146    3
147    3
148    3
149    3
Name: Species.output, Length: 150, dtype: uint32


ONNX RUNNER RESULT
     Sepal_Length  Sepal_Width  Petal_Length  ...  Score.setosa Score.versicolor  Score.virginica
0             5.1          3.5           1.4  ...  9.979612e-01         0.002039     7.896303e-15
1             4.9          3.0           1.4  ...  9.935742e-01         0.006426     1.243418e-13
2             4.7          3.2           1.3  ...  9.969639e-01         0.003036     2.946764e-14
3             4.6          3.1           1.5  ...  9.950643e-01         0.004936     1.473649e-13
4             5.0          3.6           1.4  ...  9.984953e-01         0.001505     4.957718e-15
..            ...          ...           ...  ...           ...              ...              ...
145           6.7          3.0           5.2  ...  6.576003e-09         0.002802     9.971976e-01
146           6.3          2.5           5.0  ...  3.143095e-07         0.031589     9.684103e-01
147           6.5          3.0           5.2  ...  4.240965e-07         0.031176     9.688237e-01
148           6.2          3.4           5.4  ...  1.435240e-08         0.002293     9.977069e-01
149           5.9          3.0           5.1  ...  7.885213e-06         0.121532     8.784599e-01

[150 rows x 19 columns]
Sepal_Length                        float64
Sepal_Width                         float64
Petal_Length                        float64
Petal_Width                         float64
Label                                object
Species                              uint32
Setosa                              float64
311418708f7545c0a2fd7f3db667a0cd    float32
5ab7f7a1e38348f4b66ed5e3a9c2416e    float32
776cb47f18c24a52a72e93f759808599    float32
17f5b772493b497fa3dfca2abffc6049    float32
Features.Sepal_Length               float32
Features.Sepal_Width                float32
Features.Petal_Length               float32
Features.Petal_Width                float32
PredictedLabel                       object
Score.setosa                        float32
Score.versicolor                    float32
Score.virginica                     float32
dtype: object

Column: Species  - ONNX RUNNER RESULT
0      4294967295
1      4294967295
2      4294967295
3      4294967295
4      4294967295
          ...
145    4294967295
146    4294967295
147    4294967295
148    4294967295
149    4294967295
Name: Species, Length: 150, dtype: uint32

antoniovs1029 avatar Mar 02 '20 22:03 antoniovs1029

So far it seems to me that both issues are related to the fact that a "Transforms.OptionalColumnCreator" is added to the input Label column, by NimbusML and also how NimbusML works with KeyDataViewTypes.

For Issue 1

When input Label column is not category

When it is not of type category, then NimbusML adds a "Transforms.LabelColumnKeyBooleanConverter" to the beginning of the pipeline, which then adds a ValueToKeyTransformer that maps values to keys. NimbusML never adds a KeyToValueTransformer, and that's why we only get keys. In https://github.com/dotnet/machinelearning/pull/4841 there were only 2 cases considered (described there) where ML.NET would add the KeyToValueTransformers; this case isn't one of them, so perhaps consider adding support for this case as well.

When input Label column is category

When the input Label column is of type category, then NimbusML converts it automatically to KeyDataViewType, without actually adding a ValueToKeyTransformer to the pipeline. In https://github.com/dotnet/machinelearning/pull/4841 2 cases were considered, one of which is "pass through" categorical columns, which addresses this issue. So in here, the case of having a Label column of type category, which is untouched by the inference pipeline, is supposed to be a "pass through" column. Problem is that since the OptionalColumn transform is added to the pipeline, then the Label column stop being passthrough, and the KeyToValueTransformer isn't added to the pipeline. So also changes to this need to be taken into account.

For Issue 2

The OptionalColumnTransform is saved to ONNX as an initializer, which gives default values as their input. So this might be related that the same value is given for each row... 0 for float and categorical (which, in this case, was int behind the scenes)... and "4294967295" for strings (because, it initializes in whatever string, which then isn't found by the labelEncoder, outputting int64 -1, which is then casted to uint32 as 4294967295).

Why this doesn't work in OnnxRunner, but it works in ORT is still unclear to me. It might, or might not, be an issue in OnnxTransformer which might be, perhaps, not handling initializing nodes correctly.

antoniovs1029 avatar Mar 02 '20 22:03 antoniovs1029