fairseq
fairseq copied to clipboard
NLLB inference time model loading fails due to inconsistent vocabulary size
Hi, I downloaded the dictionary and 600M NLLB-200-Distilled checkpoint. I failed to load model weights from the checkpoint due to inconsistent vocabulary size.
The dictionary has 255997 tokens and the model supports 202 languages, thus there is an extra of 202 tokens for language tokens. So the total number of tokens for embedding martix should be 255997+202+4=256203, where 4 stands for
I'm wondering what's the cause of the problem.
Here is the message I got:
RuntimeError: Error(s) in loading state_dict for TransformerModel:
size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([256206, 1024]) from checkpoint, the shape in current model is torch.Size([256203, 1024]).
size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([256206, 1024]) from checkpoint, the shape in current model is torch.Size([256203, 1024]).
size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([256206, 1024]) from checkpoint, the shape in current model is torch.Size([256203, 1024]).
There are three additional tokens in the vocabulary that we add during training. Here is a response related to this :
https://github.com/huggingface/transformers/issues/18043#issuecomment-1179317930
More specifically here are the additional fine-grained data source tokens we add :
https://github.com/facebookresearch/fairseq/blob/nllb/fairseq/data/multilingual/multilingual_utils.py#L14-L18
There are three additional tokens in the vocabulary that we add during training. Here is a response related to this :
huggingface/transformers#18043 (comment)
More specifically here are the additional fine-grained data source tokens we add :
https://github.com/facebookresearch/fairseq/blob/nllb/fairseq/data/multilingual/multilingual_utils.py#L14-L18
Thank you, that works. But the response you referred to says that there are 4 extra tokens prepended to the vocabulary rather than 3. Is that still a mistake? Besides, the internal Fairseq dictionary link he provided is inaccessible, so I can't check the implementation.
@pluiez In that post, jhcross means what he(stefan-it) wrote
0 -> <unk>
1 -> <s>
2 -> </s>
is wrong. In fairseq dictionary, it is
0 -> <s>
1 -> <pad>
2 -> </s>
3 -> <unk>
So a simply loaded spm in huggingface tokenizer will result in a wrong input ids. The link is probably refers to the normal dictioanry.py. Hopes to see all of your great success.
@gmryu Really appreciate your explanation, now it's all clear to me.
Hi, I downloaded the dictionary and 600M NLLB-200-Distilled checkpoint. I failed to load model weights from the checkpoint due to inconsistent vocabulary size.
The dictionary has 255997 tokens and the model supports 202 languages, thus there is an extra of 202 tokens for language tokens. So the total number of tokens for embedding martix should be 255997+202+4=256203, where 4 stands for symbols. However, the embedding matrix from the checkpoint file is of size 256206, making it incompatitble with the model structure built at evaluation time. The number 256206 is not divisible by 8, so the extra 3 symbols are also not dummy makeupwords.
I'm wondering what's the cause of the problem.
Here is the message I got:
RuntimeError: Error(s) in loading state_dict for TransformerModel: size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([256206, 1024]) from checkpoint, the shape in current model is torch.Size([256203, 1024]). size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([256206, 1024]) from checkpoint, the shape in current model is torch.Size([256203, 1024]). size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([256206, 1024]) from checkpoint, the shape in current model is torch.Size([256203, 1024]).
hello,how did you solve this problem when inferencing?
@wenHK could you make sure you are passing the flag --add-data-source-prefix-tags at inference time?
There is no --add-data-source-prefix-tags in the fairseq latest version. I met the similar problems.
@wenHK could you make sure you are passing the flag --add-data-source-prefix-tags at inference time?
[--add-data-source-prefix-tags ](fairseq-interactive: error: unrecognized arguments: --add-data-source-prefix-tags)
@wenHK could you make sure you are passing the flag --add-data-source-prefix-tags at inference time?
[--add-data-source-prefix-tags ](fairseq-interactive: error: unrecognized arguments: --add-data-source-prefix-tags)
@gdxie1 This option is only available in nllb branch, thus you need to checkout the branch to run inference.