sklearn-onnx icon indicating copy to clipboard operation
sklearn-onnx copied to clipboard

Skipping out-of-vocabulary n-grams during batch inference of Tf-IDF ONNXified Vectorizer

Open nssprogrammer opened this issue 1 year ago • 2 comments

An ONNXified TF-IDF vectorizer while during batch inference is dropping out-of-vocabulary texts whereas the expected behaviour is it should return zero vectors for those out-of-vocabulary texts.

Example :- Suppose a batch contains 5 documents [ d1 , d2 , d3 , d4 , d5 ] , let d3 and d5 are out-of-vocabulary texts , so the expected output should be ideally [ v1 , v2 , 0 , v4 , 0 ] , returning corresponding vectors. But the output the model returns is - [ v1 , v2 , v4 ]

ONNX v7 skl2onnx 1.13

nssprogrammer avatar Jun 07 '23 08:06 nssprogrammer

Is there a tiny model I can use to replicate your issue and see what I can do to fix it?

xadupre avatar Jun 22 '23 07:06 xadupre

@xadupre code mentioned in this ussue to reproduce this bug

from sklearn.feature_extraction.text import CountVectorizer
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import StringTensorType
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import onnxruntime

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

labels = [
    'a',
    'b',
    'c',
    'd'
]

to_inference = [
    '.',
    'first'
]

vectorizer = CountVectorizer(
    max_features=5,
    analyzer='word',
    ngram_range=(1, 1),
    encoding='utf8',
    strip_accents=None,
    token_pattern=(
            r"\b[a-zA-Z0-9_]+\b"
            r"|"
            r"[\~\-!.,:;@+&<>*={}\[\]№?()^|/%$#'`\"\\_]"
            r"|"
            r"\d+"
        )
)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())


classifier = LogisticRegression()


model = Pipeline(steps=[("vectorizer", vectorizer), ("classifier", classifier)])
model.fit(corpus, labels)
print(model.predict(to_inference))

onnx_model = convert_sklearn(model, initial_types=[('X', StringTensorType((None,)))])


session = onnxruntime.InferenceSession(onnx_model.SerializeToString(), providers=['CPUExecutionProvider'])
print(session.run((session.get_outputs()[0].name,), {session.get_inputs()[0].name: to_inference}))

vsbaldeev avatar Oct 05 '23 10:10 vsbaldeev