machinelearning
machinelearning copied to clipboard
Please provide examples of the vocab and merge file formats
Please provide examples of the vocab and merge file formats. Better yet, provide links to downloadable pre-created files for common purposes GPT3 etc.
Document Details
⚠ Do not edit this section. It is required for learn.microsoft.com ➟ GitHub issue linking.
- ID: 909b9e8c-f08c-ba0a-6c31-7dd151699999
- Version Independent ID: 909b9e8c-f08c-ba0a-6c31-7dd151699999
- Content: Bpe Constructor (Microsoft.ML.Tokenizers)
- Content Source: dotnet/xml/Microsoft.ML.Tokenizers/Bpe.xml
- Product: dotnet-ml-api
- GitHub Login: @natke
- Microsoft Alias: nakersha
@tarekgh is this something we will be able to do with your new changes?
@Ben-Pattinson thanks for pointing at that. You can look at the merges.txt and vocab.json to see the format of the files and download them too if you want. These are used for GPT-2. Are you interested to submit a doc PR to include this info?
I am currently working to support Tiktoken tokenizer which is used with GPT-4 and gpt-3.5-turbo.