Mismatch in output of onnx exported CharTokenizer model

Open antoniovs1029 opened this issue 5 years ago • 1 comments

The onnx export test for CharTokenizer is failing in the current tests so it has been disabled (link). The output comming from ML.NET, OnnxRunner, and ORT on that test are different.

Here is a repro script, and its output. Notice the difference both in values and dtypes between the different outputs.

NOTE: The DataFrameTool is the one found here in the repository.

Repro

import pandas as pd
import tempfile
from nimbusml.datasets import get_dataset
from nimbusml.preprocessing.text import CharTokenizer
from nimbusml.preprocessing import OnnxRunner
from data_frame_tool import DataFrameTool as DFT

file_path = get_dataset("wiki_detox_train").as_filepath()
dataset = pd.read_csv(file_path, sep='\t')
dataset = dataset.head(10)

estimator = CharTokenizer(columns={'SentimentText_Transform': 'SentimentText'})
estimator.fit(dataset)

print("\n\nML.NET RESULT")
result_expected = estimator.transform(dataset)
print(estimator.model_)
print(result_expected)
print(result_expected.dtypes)

print("\n\nORT RESULT")
onnx_path = "C:\\Users\\anvelazq\Desktop\\is29chartokenizer\\chartokenizer.onnx"
estimator.export_to_onnx(onnx_path, 'com.microsoft.ml')
onnxrunner = OnnxRunner(model_file=onnx_path)
result_onnx = onnxrunner.fit_transform(dataset)
print(result_onnx)
print(result_onnx.dtypes)

print("\n\nONNX RUNNER RESULT")
df_tool = DFT(onnx_path)
result_ort = df_tool.execute(dataset, [])
print(result_ort)
print(result_ort.dtypes)

Output

ML.NET RESULT
C:\Users\anvelazq\AppData\Local\Temp\tmp4dd2p6jl.model.bin
   Sentiment                                      SentimentText  SentimentText_Transform.000  ...  SentimentText_Transform.419  SentimentText_Transform.420  SentimentText_Transform.421
0          1    ==RUDE== Dude, you are rude upload that carl...                          1.0  ...                          NaN                          NaN                          NaN
1          1    == OK! ==  IM GOING TO VANDALIZE WILD ONES W...                          1.0  ...                          NaN                          NaN                          NaN
2          1     Stop trolling, zapatancas, calling me a lia...                          1.0  ...                          NaN                          NaN                          NaN
3          1    ==You're cool==  You seem like a really cool...                          1.0  ...                          NaN                          NaN                          NaN
4          1   ::::: Why are you threatening me? I'm not bei...                          1.0  ...                          NaN                          NaN                          NaN
5          1    == hey waz up? ==  hey ummm... the fif four ...                          1.0  ...                          NaN                          NaN                          NaN
6          0   ::::::::::I'm not sure either. I think it has...                          1.0  ...                          NaN                          NaN                          NaN
7          0   *::Your POV and propaganda pushing is dully n...                          1.0  ...                         45.0                         31.0                          2.0
8          0    == File:Hildebrandt-Greg and Tim.jpg listed ...                          1.0  ...                          NaN                          NaN                          NaN
9          0    ::::::::This is a gross exaggeration. Nobody...                          1.0  ...                          NaN                          NaN                          NaN

[10 rows x 424 columns]
Sentiment                        int64
SentimentText                   object
SentimentText_Transform.000    float64
SentimentText_Transform.001    float64
SentimentText_Transform.002    float64
                                ...
SentimentText_Transform.417    float64
SentimentText_Transform.418    float64
SentimentText_Transform.419    float64
SentimentText_Transform.420    float64
SentimentText_Transform.421    float64
Length: 424, dtype: object


ORT RESULT
   Sentiment                                      SentimentText  SentimentText_Transform.000  ...  SentimentText_Transform.419  SentimentText_Transform.420  SentimentText_Transform.421
0          1    ==RUDE== Dude, you are rude upload that carl...                          2.0  ...                          NaN                          NaN                          NaN
1          1    == OK! ==  IM GOING TO VANDALIZE WILD ONES W...                          2.0  ...                          NaN                          NaN                          NaN
2          1     Stop trolling, zapatancas, calling me a lia...                          2.0  ...                          NaN                          NaN                          NaN
3          1    ==You're cool==  You seem like a really cool...                          2.0  ...                          NaN                          NaN                          NaN
4          1   ::::: Why are you threatening me? I'm not bei...                          2.0  ...                          NaN                          NaN                          NaN
5          1    == hey waz up? ==  hey ummm... the fif four ...                          2.0  ...                          NaN                          NaN                          NaN
6          0   ::::::::::I'm not sure either. I think it has...                          2.0  ...                          NaN                          NaN                          NaN
7          0   *::Your POV and propaganda pushing is dully n...                          2.0  ...                         46.0                         32.0                          3.0
8          0    == File:Hildebrandt-Greg and Tim.jpg listed ...                          2.0  ...                          NaN                          NaN                          NaN
9          0    ::::::::This is a gross exaggeration. Nobody...                          2.0  ...                          NaN                          NaN                          NaN

[10 rows x 424 columns]
Sentiment                        int64
SentimentText                   object
SentimentText_Transform.000    float32
SentimentText_Transform.001    float32
SentimentText_Transform.002    float32
                                ...
SentimentText_Transform.417    float32
SentimentText_Transform.418    float32
SentimentText_Transform.419    float32
SentimentText_Transform.420    float32
SentimentText_Transform.421    float32
Length: 424, dtype: object


ONNX RUNNER RESULT
   Sentiment.output                               SentimentText.output  SentimentText_Transform.output.0  ...  SentimentText_Transform.output.419  SentimentText_Transform.output.420  SentimentText_Transform.output.421
0                 1    ==RUDE== Dude, you are rude upload that carl...                                 2  ...                               65535                               65535                               65535
1                 1    == OK! ==  IM GOING TO VANDALIZE WILD ONES W...                                 2  ...                               65535                               65535                               65535
2                 1     Stop trolling, zapatancas, calling me a lia...                                 2  ...                               65535                               65535                               65535
3                 1    ==You're cool==  You seem like a really cool...                                 2  ...                               65535                               65535                               65535
4                 1   ::::: Why are you threatening me? I'm not bei...                                 2  ...                               65535                               65535                               65535
5                 1    == hey waz up? ==  hey ummm... the fif four ...                                 2  ...                               65535                               65535                               65535
6                 0   ::::::::::I'm not sure either. I think it has...                                 2  ...                               65535                               65535                               65535
7                 0   *::Your POV and propaganda pushing is dully n...                                 2  ...                                  46                                  32                                   3
8                 0    == File:Hildebrandt-Greg and Tim.jpg listed ...                                 2  ...                               65535                               65535                               65535
9                 0    ::::::::This is a gross exaggeration. Nobody...                                 2  ...                               65535                               65535                               65535

[10 rows x 424 columns]
Sentiment.output                       int64
SentimentText.output                  object
SentimentText_Transform.output.0      uint16
SentimentText_Transform.output.1      uint16
SentimentText_Transform.output.2      uint16
                                       ...
SentimentText_Transform.output.417    uint16
SentimentText_Transform.output.418    uint16
SentimentText_Transform.output.419    uint16
SentimentText_Transform.output.420    uint16
SentimentText_Transform.output.421    uint16
Length: 424, dtype: object

Apr 09 '20 07:04 antoniovs1029

There are 3 main independent issues that explain this behavior, and that would require independent solutions to make the CharTokenizer onnx export to have the same outputs from ML.NET, OnnxRunner and ORT.

Offset between ML.NET columns and OnnxRunner/ORT columns

When using the TokenizingByCharactersTransformer on ML.NET (without NimbusML), the output of the transformer is of type Vector<Key<UInt16>> (i.e. a Vector of KeyDataViewTypes that have UInt16 as RawType).

Because the outputs are Keys, it is affected by the same issue found on #428, i.e. that the output for ML.NET's Key columns in NimbusML is 0-based, whereas the output of the OnnxRunner and ORT is 1-based. It seems that this offset is caused because NimbusML automatically substracts "1" from KeyDataViewTypes somewhere in the code. In the issue in here, this behavior can be clearly seen on row 7 of the output.

Fixing #428, should fix this issue in here. As discussed offline, it seems that solving that issue would require major changes in how NimbusML works with Categorical pandas columns and ML.NET's KeyDataViewTypes. So this is a major blocker to fix these issues.

NaN vs. 65535 (float vs. uint16)

The TokenizingByCharactersTransformer maps the char \uffff to 65535, this character is often regarded as "not a character" for UTF-16 (the encoding that the tokenizer uses). The onnx exported model does the exact same mapping. And it seems to me that where there's no character to map on any given SentimentText_Transform column, it will actually try to map a \uffff, and that's why we get a 65535 in the ORT output for those columns.

But, why do we get NaN in the other outputs? This has to do with PR #267 "Add variable length vector support". What is relevant from that PR in here, is that NimbusML will take a variable length uint16 vector column and will cast it to float, and also add NaNs where values are "missing". The exact mechanism of how this PR works isn't clear to me, but playing around by modifying what was introduced on that PR (particullarly on files PythonInterop.h and PythonInterop.cpp) it seems clear that this issue here is related to that PR. It could be that the last columns of any variable length uint vector are simply filled with NaNs without applying the Tokenizer on those columns, or it could be that somehow the tokenizer is actually applied and then the uint's 65535 are mapped to float's NaN's.

Since the output of ORT doesn't involve NimbusML, it doesn't make the described casts, nor it fills missing values with NaNs, so that's why it is uint16 and has the 65535's.

On the other hand, since ML.NET (without NimbusML) has no problem working with columns with vectors of variable sizes, this issue isn't reproducible using only ML.NET (without NimbusML).

Fixing this issue might require changing or reverting PR #267 (which might bring its own set of problems, given that the behavior of that PR was introduced for a reason), or modify ML.NET's TokenizingByCharactersTransformer so that it outputs floats instead of uint16's. Either way, further discussion regarding this topic would be needed.

float64 vs. float32

The output from ML.NET is float64 whereas the output of OnnxRunner is float32. Again, this is related to PR #267.

Without using NimbusML, the output of the TokenizingByCharactersTransformer on ML.NET is of type Vector<Key<UInt16>>, but the output of applying the exported onnx model (without NimbusML) is of type Vector<UInt16>. This difference, somehow, is related to the fact that NimbusML casts the first case as float64 and the second case as float32. The code that says how the mappings of the variable length vectors should be handled is the following:

https://github.com/microsoft/NimbusML/blob/1b7c3990df6c87f5ce31e0c05df50b01e8e5001f/src/NativeBridge/PythonInterop.cpp#L47-L64

Where it says that UInt16 (i.e. unsigned short or U2) should be mapped to float32 (i.e. having float as a dtype, whereas double would be float64).

There doesn't seem to be any clear indication to treat Key<UInt16> differently from UInt16. But in this code here, it seems that when a DataView is sent from ML.NET to Python, if there's a KeyDataViewType with RawType U2, then that gets casted to I4. Assuming this also holds for Key vectors, then that would explain why ML.NET's output is float64 (i.e. double).

https://github.com/microsoft/NimbusML/blob/1b7c3990df6c87f5ce31e0c05df50b01e8e5001f/src/DotNetBridge/NativeDataInterop.cs#L141-L151

The exact mechanisms that explain all of the above castings would need further investigation. But perhaps this type mismatch between float64 and float32 isn't a blocking issue, and it wouldn't need to be fixed.

Apr 09 '20 07:04 antoniovs1029