metaseq why the "vocab_size" in config file is 50272 but the len(tokenizer) is 50265.

🐛 Bug

The "vocab_size" in config file is 50272 but the len(tokenizer) is 50265, they not match eacch other.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Run cmd '....'
See error

None

Code sample

model.resize_token_embeddings(len(tokenizer))

Expected behavior

The results seem good when I use the code abbove to align to tokenizer, but I just wonder why the vocab size for training is 50272, did I miss some important parameter?

Environment

metaseq Version (e.g., 1.0 or master):
PyTorch Version (e.g., 1.0)
OS (e.g., Linux, Windows, MacOS):
How you installed metaseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

Oct 30 '22 04:10 Zcchill

Tokenizer saved has length 50265 but then we add 4 special tokens: https://github.com/facebookresearch/metaseq/blob/e2df6a021cc5ee024533427ae476ce29cdb65b66/metaseq/tasks/streaming_language_modeling.py#L158 which gives us a dictionary vocab size of 50269 at this point. This is followed by a pad_to_multiple(8): https://github.com/facebookresearch/metaseq/blob/e2df6a021cc5ee024533427ae476ce29cdb65b66/metaseq/tasks/streaming_language_modeling.py#L169, which is why vocab size ends up being 50272.

Oct 30 '22 06:10 suchenzang

@suchenzang - Thank you for your answering! It seems that the 4 special token have already been among the 50265 tokens. It seems that only pad_to_multiple(8): make the vocab size from 50265 to 50272. what I mean is that are id 50265-50272 all "madeupword"?

And does it mean that the use of model.resize_token_embeddings(len(tokenizer)) have none bad influence?

tokenizer = AutoTokenizer.from_pretrained('facebook/opt-125m', use_fast=False)
model = AutoModelForCausalLM.from_pretrained('facebook/opt-125m',cache_dir='/ssdwork/cache/').cuda()
all_text  ='Which poem is the best one, and please write it to me.'
input_ids = tokenizer(all_text, return_tensors="pt").input_ids.cuda()
outputs = model.generate(input_ids, do_sample=False, max_length=256, num_beams=1)
output_decode = tokenizer.batch_decode(outputs, skip_special_tokens=True)
output_decode = output_decode[0]
print(output_decode) # the result:Which poem is the best one, and please write it to me.\nI'm not sure, but I think it's the one by the author of the poem.
model.resize_token_embeddings(len(tokenizer))
all_text  ='Which poem is the best one, and please write it to me.'
input_ids = tokenizer(all_text, return_tensors="pt").input_ids.cuda()
outputs = model.generate(input_ids, do_sample=False, max_length=256, num_beams=1)
output_decode = tokenizer.batch_decode(outputs, skip_special_tokens=True)
output_decode = output_decode[0]
print(output_decode) # the result:Which poem is the best one, and please write it to me.\nI'm not sure, but I think it's the one by the author of the poem.

Oct 30 '22 06:10 Zcchill

I have the same question, and if it is ok to use a roberta tokenizer instead ?

Mar 12 '23 08:03 baiyuting

Same Questions,Will it cause Index Error?

Jul 26 '23 09:07 Hdiao112