machinelearning
machinelearning copied to clipboard
[Tokenizers] Port LLaMA Tokenizer and SentencePiece algorithm
The SentencePiece algorithm should be added to Microsoft.ML.Tokenizers. This is a dependency of LLaMATokenizer which we also wish to enable.
We can see reference implementations in https://github.com/microsoft/BlingFire (MIT license) https://github.com/google/sentencepiece (Apache license) https://github.com/huggingface/tokenizers (Apache license) https://huggingface.co/docs/transformers/main/en/model_doc/llama
Hugging face also has Llama2 - might be interesting to understand if that's also worth including or designing for later inclusion.
LLaMA Tokenizer: https://arxiv.org/abs/2203.13474 https://arxiv.org/pdf/2203.13474.pdf
Sentence Piece: https://arxiv.org/abs/1808.06226 https://arxiv.org/pdf/1808.06226.pdf