YouTokenToMe
YouTokenToMe copied to clipboard
How to generate vocab.json and merges.txt for YTTM tokenizer?
I want to train a GPT2 model with new vocabulary. I am following instructions given here: https://github.com/mgrankin/ru_transformers. YTTM tokenizer outputs a yt.model file that has the new vocab. However the run_generation.py script requires vocab.json and merges.txt files. I can see the vocab with below command:
yttm vocab --model yt.model
But I don't know how to convert it into vocab.json and merges.txt format. Shouldn't this have been a common problem?
This is also an issue for me.