machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

[Tokenizers] Port LLaMA Tokenizer and SentencePiece algorithm

Open ericstj opened this issue 1 year ago • 0 comments

The SentencePiece algorithm should be added to Microsoft.ML.Tokenizers. This is a dependency of LLaMATokenizer which we also wish to enable.

We can see reference implementations in https://github.com/microsoft/BlingFire (MIT license) https://github.com/google/sentencepiece (Apache license) https://github.com/huggingface/tokenizers (Apache license) https://huggingface.co/docs/transformers/main/en/model_doc/llama

Hugging face also has Llama2 - might be interesting to understand if that's also worth including or designing for later inclusion.

LLaMA Tokenizer: https://arxiv.org/abs/2203.13474 https://arxiv.org/pdf/2203.13474.pdf

Sentence Piece: https://arxiv.org/abs/1808.06226 https://arxiv.org/pdf/1808.06226.pdf

ericstj avatar Feb 05 '24 16:02 ericstj