why the "vocab_size" in config file is 50272 but the len(tokenizer) is 50265.
🐛 Bug
The "vocab_size" in config file is 50272 but the len(tokenizer) is 50265, they not match eacch other.To Reproduce
Steps to reproduce the behavior (always include the command you ran):
- Run cmd '....'
- See error
Code sample
model.resize_token_embeddings(len(tokenizer))Expected behavior
The results seem good when I use the code abbove to align to tokenizer, but I just wonder why the vocab size for training is 50272, did I miss some important parameter?Environment
- metaseq Version (e.g., 1.0 or master):
- PyTorch Version (e.g., 1.0)
- OS (e.g., Linux, Windows, MacOS):
- How you installed metaseq (
pip, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information:
Additional context
Tokenizer saved has length 50265 but then we add 4 special tokens: https://github.com/facebookresearch/metaseq/blob/e2df6a021cc5ee024533427ae476ce29cdb65b66/metaseq/tasks/streaming_language_modeling.py#L158 which gives us a dictionary vocab size of 50269 at this point. This is followed by a pad_to_multiple(8): https://github.com/facebookresearch/metaseq/blob/e2df6a021cc5ee024533427ae476ce29cdb65b66/metaseq/tasks/streaming_language_modeling.py#L169, which is why vocab size ends up being 50272.
@suchenzang - Thank you for your answering! It seems that the 4 special token have already been among the 50265 tokens.
It seems that only pad_to_multiple(8): make the vocab size from 50265 to 50272. what I mean is that are id 50265-50272 all "madeupword"?
- And does it mean that the use of
model.resize_token_embeddings(len(tokenizer))have none bad influence?
tokenizer = AutoTokenizer.from_pretrained('facebook/opt-125m', use_fast=False)
model = AutoModelForCausalLM.from_pretrained('facebook/opt-125m',cache_dir='/ssdwork/cache/').cuda()
all_text ='Which poem is the best one, and please write it to me.'
input_ids = tokenizer(all_text, return_tensors="pt").input_ids.cuda()
outputs = model.generate(input_ids, do_sample=False, max_length=256, num_beams=1)
output_decode = tokenizer.batch_decode(outputs, skip_special_tokens=True)
output_decode = output_decode[0]
print(output_decode) # the result:Which poem is the best one, and please write it to me.\nI'm not sure, but I think it's the one by the author of the poem.
model.resize_token_embeddings(len(tokenizer))
all_text ='Which poem is the best one, and please write it to me.'
input_ids = tokenizer(all_text, return_tensors="pt").input_ids.cuda()
outputs = model.generate(input_ids, do_sample=False, max_length=256, num_beams=1)
output_decode = tokenizer.batch_decode(outputs, skip_special_tokens=True)
output_decode = output_decode[0]
print(output_decode) # the result:Which poem is the best one, and please write it to me.\nI'm not sure, but I think it's the one by the author of the poem.
I have the same question, and if it is ok to use a roberta tokenizer instead ?
Same Questions,Will it cause Index Error?