metaseq icon indicating copy to clipboard operation
metaseq copied to clipboard

Converting OPT-175B tokenizer to HF format?

Open mawilson1234 opened this issue 1 year ago • 2 comments

❓ Questions and Help

What is your question?

I've downloaded the weights for OPT-175B using the URL I got after filling out the Google form. I've also got dict.txt, gpt2-merges.txt, and gpt2-vocab.json. My existing workflow uses the Hugging Face API, so I've converted the weights to HF format using the script here.

However, I'm not sure how to convert the tokenizer to HF format from the files. I see there is a way to make a tokenizer using the gpt2-merges.txt and gpt2-vocab.json files, but that means dict.txt is unused, which strikes me as likely to cause issues (I can't imagine it would exist if it were not needed). Is there a way to do this?

As an alternative, the smaller OPT models and their tokenizers are available on the HF Hub, so I can just get them from there. Do all the OPT models, including 175B, use the same tokenizer?

If it doesn't make a difference, I could just use the tokenizer from HF for one of the smaller models instead. I could easily verify for myself whether the smaller models have identical tokenizers by comparing the HF tokenizers for the different sizes, but that won't tell me necessarily whether 175B uses the same one, since it's not on there as such.

mawilson1234 avatar Apr 07 '23 16:04 mawilson1234