fairseq NLLB inference time model loading fails due to inconsistent vocabulary size

Hi, I downloaded the dictionary and 600M NLLB-200-Distilled checkpoint. I failed to load model weights from the checkpoint due to inconsistent vocabulary size.

The dictionary has 255997 tokens and the model supports 202 languages, thus there is an extra of 202 tokens for language tokens. So the total number of tokens for embedding martix should be 255997+202+4=256203, where 4 stands for symbols. However, the embedding matrix from the checkpoint file is of size 256206, making it incompatitble with the model structure built at evaluation time. The number 256206 is not divisible by 8, so the extra 3 symbols are also not dummy makeupwords.

I'm wondering what's the cause of the problem.

Here is the message I got:

RuntimeError: Error(s) in loading state_dict for TransformerModel:
        size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([256206, 1024]) from checkpoint, the shape in current model is torch.Size([256203, 1024]).
        size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([256206, 1024]) from checkpoint, the shape in current model is torch.Size([256203, 1024]).
        size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([256206, 1024]) from checkpoint, the shape in current model is torch.Size([256203, 1024]).

Jul 08 '22 16:07 pluiez

There are three additional tokens in the vocabulary that we add during training. Here is a response related to this :

https://github.com/huggingface/transformers/issues/18043#issuecomment-1179317930

More specifically here are the additional fine-grained data source tokens we add :

https://github.com/facebookresearch/fairseq/blob/nllb/fairseq/data/multilingual/multilingual_utils.py#L14-L18

Jul 08 '22 20:07 vedanuj

There are three additional tokens in the vocabulary that we add during training. Here is a response related to this :

huggingface/transformers#18043 (comment)

More specifically here are the additional fine-grained data source tokens we add :

https://github.com/facebookresearch/fairseq/blob/nllb/fairseq/data/multilingual/multilingual_utils.py#L14-L18

Thank you, that works. But the response you referred to says that there are 4 extra tokens prepended to the vocabulary rather than 3. Is that still a mistake? Besides, the internal Fairseq dictionary link he provided is inaccessible, so I can't check the implementation.

Jul 09 '22 03:07 pluiez

@pluiez In that post, jhcross means what he(stefan-it) wrote

 0 -> <unk>
 1 -> <s>
 2 -> </s>

is wrong. In fairseq dictionary, it is

0 -> <s> 
1 -> <pad>
2 -> </s>
3 -> <unk>

So a simply loaded spm in huggingface tokenizer will result in a wrong input ids. The link is probably refers to the normal dictioanry.py. Hopes to see all of your great success.

Jul 09 '22 13:07 gmryu

@gmryu Really appreciate your explanation, now it's all clear to me.

Jul 09 '22 13:07 pluiez

Hi, I downloaded the dictionary and 600M NLLB-200-Distilled checkpoint. I failed to load model weights from the checkpoint due to inconsistent vocabulary size.

The dictionary has 255997 tokens and the model supports 202 languages, thus there is an extra of 202 tokens for language tokens. So the total number of tokens for embedding martix should be 255997+202+4=256203, where 4 stands for symbols. However, the embedding matrix from the checkpoint file is of size 256206, making it incompatitble with the model structure built at evaluation time. The number 256206 is not divisible by 8, so the extra 3 symbols are also not dummy makeupwords.

I'm wondering what's the cause of the problem.

Here is the message I got:
RuntimeError: Error(s) in loading state_dict for TransformerModel:
        size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([256206, 1024]) from checkpoint, the shape in current model is torch.Size([256203, 1024]).
        size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([256206, 1024]) from checkpoint, the shape in current model is torch.Size([256203, 1024]).
        size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([256206, 1024]) from checkpoint, the shape in current model is torch.Size([256203, 1024]).

hello，how did you solve this problem when inferencing?

Aug 09 '22 07:08 wenHK

@wenHK could you make sure you are passing the flag --add-data-source-prefix-tags at inference time?

Aug 09 '22 16:08 annasun28

There is no --add-data-source-prefix-tags in the fairseq latest version. I met the similar problems.

Nov 15 '22 18:11 gdxie1

@wenHK could you make sure you are passing the flag --add-data-source-prefix-tags at inference time?

[--add-data-source-prefix-tags ](fairseq-interactive: error: unrecognized arguments: --add-data-source-prefix-tags)

Nov 15 '22 20:11 gdxie1

@wenHK could you make sure you are passing the flag --add-data-source-prefix-tags at inference time?

[--add-data-source-prefix-tags ](fairseq-interactive: error: unrecognized arguments: --add-data-source-prefix-tags)

@gdxie1 This option is only available in nllb branch, thus you need to checkout the branch to run inference.

Dec 28 '22 07:12 pluiez

fairseq fairseq copied to clipboard

NLLB inference time model loading fails due to inconsistent vocabulary size

fairseq
fairseq copied to clipboard