sklearn-onnx
sklearn-onnx copied to clipboard
Skipping out-of-vocabulary n-grams during batch inference of Tf-IDF ONNXified Vectorizer
An ONNXified TF-IDF vectorizer while during batch inference is dropping out-of-vocabulary texts whereas the expected behaviour is it should return zero vectors for those out-of-vocabulary texts.
Example :- Suppose a batch contains 5 documents [ d1 , d2 , d3 , d4 , d5 ] , let d3 and d5 are out-of-vocabulary texts , so the expected output should be ideally [ v1 , v2 , 0 , v4 , 0 ] , returning corresponding vectors. But the output the model returns is - [ v1 , v2 , v4 ]
ONNX v7 skl2onnx 1.13
Is there a tiny model I can use to replicate your issue and see what I can do to fix it?
@xadupre code mentioned in this ussue to reproduce this bug
from sklearn.feature_extraction.text import CountVectorizer
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import StringTensorType
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import onnxruntime
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
labels = [
'a',
'b',
'c',
'd'
]
to_inference = [
'.',
'first'
]
vectorizer = CountVectorizer(
max_features=5,
analyzer='word',
ngram_range=(1, 1),
encoding='utf8',
strip_accents=None,
token_pattern=(
r"\b[a-zA-Z0-9_]+\b"
r"|"
r"[\~\-!.,:;@+&<>*={}\[\]№?()^|/%$#'`\"\\_]"
r"|"
r"\d+"
)
)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
classifier = LogisticRegression()
model = Pipeline(steps=[("vectorizer", vectorizer), ("classifier", classifier)])
model.fit(corpus, labels)
print(model.predict(to_inference))
onnx_model = convert_sklearn(model, initial_types=[('X', StringTensorType((None,)))])
session = onnxruntime.InferenceSession(onnx_model.SerializeToString(), providers=['CPUExecutionProvider'])
print(session.run((session.get_outputs()[0].name,), {session.get_inputs()[0].name: to_inference}))