machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Please provide examples of the vocab and merge file formats

Open Ben-Pattinson opened this issue 2 years ago • 2 comments

Please provide examples of the vocab and merge file formats. Better yet, provide links to downloadable pre-created files for common purposes GPT3 etc.


Document Details

Do not edit this section. It is required for learn.microsoft.com ➟ GitHub issue linking.

Ben-Pattinson avatar Mar 01 '23 09:03 Ben-Pattinson

@tarekgh is this something we will be able to do with your new changes?

michaelgsharp avatar Jan 23 '24 23:01 michaelgsharp

@Ben-Pattinson thanks for pointing at that. You can look at the merges.txt and vocab.json to see the format of the files and download them too if you want. These are used for GPT-2. Are you interested to submit a doc PR to include this info?

I am currently working to support Tiktoken tokenizer which is used with GPT-4 and gpt-3.5-turbo.

tarekgh avatar Jan 24 '24 00:01 tarekgh