sklearn-onnx icon indicating copy to clipboard operation
sklearn-onnx copied to clipboard

Custom tokenizer cannot be converted into ONNX

Open Vikramardham opened this issue 3 years ago • 1 comments

Hello,

I am trying to convert a rather simple sklearn model to ONNX. I would like to save a pipeline containing a CountVectorizer with a custom tokenizer followed a simple classifier. Apparently, custom tokenizer is not allowed. Is there a fix or a work-around?

I'd greatly appreciate any help!

Vikramardham avatar Jun 21 '21 12:06 Vikramardham

ONNX does not have an official operator to tokenize strings. One custom operator is implemented by onnxruntime and uses re2 to split a string into words. This option is supported by sklearn-onnx. There might be discrepencies as sklearn uses package re to tokenizer (re2 and re are not exactly the same).

Other tokenizers are implemented in onnxruntime-extensions (see documentation). For the time being, it would be difficult to insert one of these classifier in a scikit-learn pipeline and then convert it into ONNX (it is implemented unless you write your own converter).

Let us know if this kind of features would be interesting for you.

xadupre avatar Jul 01 '21 11:07 xadupre