YouTokenToMe icon indicating copy to clipboard operation
YouTokenToMe copied to clipboard

How to generate vocab.json and merges.txt for YTTM tokenizer?

Open nikhilno1 opened this issue 4 years ago • 1 comments

I want to train a GPT2 model with new vocabulary. I am following instructions given here: https://github.com/mgrankin/ru_transformers. YTTM tokenizer outputs a yt.model file that has the new vocab. However the run_generation.py script requires vocab.json and merges.txt files. I can see the vocab with below command:

yttm vocab --model yt.model

But I don't know how to convert it into vocab.json and merges.txt format. Shouldn't this have been a common problem?

nikhilno1 avatar Mar 08 '20 16:03 nikhilno1

This is also an issue for me.

ckoshka avatar Jul 26 '21 23:07 ckoshka