sklearn-onnx
sklearn-onnx copied to clipboard
Custom tokenizer cannot be converted into ONNX
Hello,
I am trying to convert a rather simple sklearn model to ONNX. I would like to save a pipeline containing a CountVectorizer with a custom tokenizer followed a simple classifier. Apparently, custom tokenizer is not allowed. Is there a fix or a work-around?
I'd greatly appreciate any help!
ONNX does not have an official operator to tokenize strings. One custom operator is implemented by onnxruntime and uses re2 to split a string into words. This option is supported by sklearn-onnx. There might be discrepencies as sklearn uses package re to tokenizer (re2 and re are not exactly the same).
Other tokenizers are implemented in onnxruntime-extensions (see documentation). For the time being, it would be difficult to insert one of these classifier in a scikit-learn pipeline and then convert it into ONNX (it is implemented unless you write your own converter).
Let us know if this kind of features would be interesting for you.