metaseq
metaseq copied to clipboard
Converting OPT-175B tokenizer to HF format?
❓ Questions and Help
What is your question?
I've downloaded the weights for OPT-175B using the URL I got after filling out the Google form. I've also got dict.txt
, gpt2-merges.txt
, and gpt2-vocab.json
. My existing workflow uses the Hugging Face API, so I've converted the weights to HF format using the script here.
However, I'm not sure how to convert the tokenizer to HF format from the files. I see there is a way to make a tokenizer using the gpt2-merges.txt
and gpt2-vocab.json
files, but that means dict.txt
is unused, which strikes me as likely to cause issues (I can't imagine it would exist if it were not needed). Is there a way to do this?
As an alternative, the smaller OPT models and their tokenizers are available on the HF Hub, so I can just get them from there. Do all the OPT models, including 175B, use the same tokenizer?
If it doesn't make a difference, I could just use the tokenizer from HF for one of the smaller models instead. I could easily verify for myself whether the smaller models have identical tokenizers by comparing the HF tokenizers for the different sizes, but that won't tell me necessarily whether 175B uses the same one, since it's not on there as such.
After some testing, it appears that the tokenizers on HF are probably the same as the one for OPT-175B (at the very least, my output for a short test made sense when decoded with the tokenizer available on HF for facebook/opt-125m
). But it'd still be nice to be sure, just in case.
@mawilson1234 I believe you are correct. I used "tokenizer_config.json" & "special_tokens_map.json" from HF OPT model repo.
Tips (OPT HF link):
- OPT has the same architecture as BartDecoder.
- Contrary to GPT2, OPT adds the EOS token </s> to the beginning of every prompt. Note: Make sure to pass use_fast=False when loading OPT’s tokenizer with [AutoTokenizer](https://huggingface.co/docs/transformers/v4.19.2/en/model_doc/auto#transformers.AutoTokenizer) to get the correct tokenizer.
You can try generating tokenizer with:
vocab_file = os.path.join(model_path, "gpt2-vocab.json")
merges_file = os.path.join(model_path, "gpt2-merges.txt")
tokenizer = GPT2Tokenizer(vocab_file, merges_file)
tokenizer.save_pretrained(model_path)