machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Add support for .tiktoken file format to Microsoft.ML.Tokenizers

Open luisquintanilla opened this issue 2 years ago • 2 comments

OpenAI library tiktoken provides tokenization support for. Older GPT models were compatible with common BPE tokenizer vocab format (vocab.json / merges.txt). More recent models support other formats.

Update Microsoft.ML.Tokenizers to provide support for clk100_base vocabulary format

luisquintanilla avatar Apr 26 '23 17:04 luisquintanilla

@tarekgh is this something that's possible today with the current BPE tokenizer or is it limited to the vocab.json / merges.txt conventions?

luisquintanilla avatar Apr 26 '23 17:04 luisquintanilla

From what I am seeing clk100_base is the vocab file which just needs to get parsed. It is a simple format which can be easily done. The thing that is not clear to me and will be good if someone can look at is, tiktoken tokenizer is not using merges.txt or similar thing. How tiktoken will be different than Bpe in this part.

https://github.com/openai/tiktoken/issues/78 https://github.com/aiqinxuancai/TiktokenSharp/tree/3f68ecdb71d3f855fbfad4aa45dae470574c2378

tarekgh avatar Apr 26 '23 19:04 tarekgh