Support loading tokenizer from `sentencepiece` model
Currently, if a sentencepiece .model file is provided, the user must run a provided script to convert into the equivalent tokenizer.json. By supporting sentencepiece models directly, we can avoid this requirement.
Note: This requires creating a TokenizerLike trait and abstracting the tokenizer source file loading process, which is already complicated given the need to support loading from GGUF tokenizers.
Potential crate: https://docs.rs/sentencepiece/latest/sentencepiece/
the user must run a provided script to convert into the equivalent
tokenizer.json
Something that isn't very straight-forward apparently from upstream advice? 🤔
An alternative option is deferring to a converter if you can reliably convert tokenizer.model to tokenizer.json?
I find the python tools a bit difficult to follow when I've looked at source of some of them.
which is already complicated given the need to support loading from GGUF tokenizers.
Is it not able to leverage that? I thought I had read that Unigram was effectively SentencePiece? If the configuration can be mapped to the tokenizers logic we already have then all we need is the data from tokenizers.model to fit into PropsGGUF?
My earlier questions for the Unigram decoder/normalizer config seems to be covered by the SentencePiece protobuf descriptions:
https://github.com/google/sentencepiece/blob/6225e08edb2577757163b3f5dbba4c0b670ef445/src/sentencepiece_model.proto#L254-L263
At a glance I can see why it may be preferrable to adopt the crate instead of trying to parse it into tokenizers format.
Is there much demand for this feature? I wouldn't bother with it while mistral.rs is still in early stages, unless user interest for it is sufficient to justify it? You've already got plenty to focus on as-is right now 😅
Those users may also find it acceptable to convert the tokenizer.model. The friction may be that it doesn't appear to be documented well, or at least when I tried to search for it I know I struggled. It's possible that users interested in this support may want it for similar popular models, which could potentially be served via docs guiding how to convert those (if the steps differ).
If that becomes a burden to maintain, then tokenizer.model may be worthwhile integrating support for?
FWIW here is exllamav2 approach for SentencePiece support via a wrapper of their own that seems to match HF tokenizers API? (both extend from this common tokenizer class).
I think that this could be an interesting feature, but as you said, I also haven't received any issues about this. It is probably not critical, and we have other things to work on.
I find the python tools a bit difficult to follow when I've looked at source of some of them.
Would you be able to tell me which ones? I would like to improve their quality.
Would you be able to tell me which ones?
It was early on into learning about LLMs, so it may have just been unfamiliarity and my lack of Python skills.
I think it was the llama.cpp python script for converting/quantizing to GGUF, and the same for EXL2.
There was also the one referenced in candle repo for converting tokenizer.model to tokenizer.json, one of the issues involved you and a maintainer there IIRC stating that a lib would be a bit tricky, but a rust CLI tool was doable in their opinion? They also said that that sort of conversion wasn't easy to give instructions for. I don't recall the linked source being that well documented either, so assume you had to read through the source.
I had not tried any of them locally, perhaps they gave decent CLI help output 🤷♂️ I was partially interested in the logic itself for writing a converter tool in rust.
Ok, great. I think the script right now is fine, but we should consider support for loading directly from sentencepiece as a longer-term goal.