machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Implement Sentencepiece Unigram tokenizer

Open arthurvb opened this issue 1 year ago • 2 comments

Is your feature request related to a problem? Please describe. I want to use a multilingual model from Huggingface ( https://huggingface.co/intfloat/multilingual-e5-large ) and the tokenizer is a sentencepiece unigram tokenizer, so I am unable to port it to C#/ONNX

Describe the solution you'd like Support for the unigram sentencepiece tokenizer in the Microsoft.ML.Tokenizers package.

Describe alternatives you've considered Blingfire, but seems not maintained anymore and unclear if it would return exactly the same token-id's.

Thank you for your time and effort (the library in general is great!)

arthurvb avatar Jul 03 '24 13:07 arthurvb

@tarekgh do any of our existing tokenizers support this, or is this new work?

ericstj avatar Aug 05 '24 18:08 ericstj

do any of our existing tokenizers support this, or is this new work?

This is a new model that needs to implement.

tarekgh avatar Aug 05 '24 22:08 tarekgh