machinelearning
machinelearning copied to clipboard
Add support for .tiktoken file format to Microsoft.ML.Tokenizers
OpenAI library tiktoken provides tokenization support for. Older GPT models were compatible with common BPE tokenizer vocab format (vocab.json / merges.txt). More recent models support other formats.
Update Microsoft.ML.Tokenizers to provide support for clk100_base vocabulary format
@tarekgh is this something that's possible today with the current BPE tokenizer or is it limited to the vocab.json / merges.txt conventions?
From what I am seeing clk100_base is the vocab file which just needs to get parsed. It is a simple format which can be easily done. The thing that is not clear to me and will be good if someone can look at is, tiktoken tokenizer is not using merges.txt or similar thing. How tiktoken will be different than Bpe in this part.
https://github.com/openai/tiktoken/issues/78 https://github.com/aiqinxuancai/TiktokenSharp/tree/3f68ecdb71d3f855fbfad4aa45dae470574c2378