fairseq
fairseq copied to clipboard
RunTime Error: Error(s) in loading state_dict for TransformerModel
🐛 Bug
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
1- Hi, Whenever I download the entire folder of the checkpoint from the Virtual machine server into my personal computer or even transfer checkpoint from one VM to another I encounter the following error:
RuntimeError: Error(s) in loading state_dict for TransformerModel: Unexpected key(s) in state_dict: "encoder.layers.0.in_proj_weight", "encoder.layers.0.in_proj_bias", "encoder.layers.1.in_proj_weight", "encoder.layers.1.in_proj_bias", "encoder.layers.2.in_proj_weight", "encoder.layers.2.in_proj_bias", "encoder.layers.3.in_proj_weight", "encoder.layers.3.in_proj_bias", "encoder.layers.4.in_proj_weight", "encoder.layers.4.in_proj_bias", "encoder.layers.5.in_proj_weight", "encoder.layers.5.in_proj_bias".
I have trained different models and encountered this issue whenever I transfer a checkpoint from one VM to the other the checkpoint doesn't get loaded in the new VM.
Interestingly the checkpoint works well without any issues in the machine that it's trained on but not in any other device "VM or local computer even GCP ", do you have any suggestions?
this is probably due to a different version of fairseq used during training and loading. You probably have an older version installed locally.
Please check the two fairseq versions.
If you have the same, please share a specific fairseq-train
command with all the parameters so we can reproduce.
Thanks for your response I have checked the fairseq version its __version__ = "0.12.2
, I have also checked pytorch version, and its also 1.12.1+cu102
in both machines.
currently, I am doing fairseq listen on the trained model but here is the code part of the script
pd2en_large_model_path="/home/mohammed_yahia3/models/"
pd2en_small_model_path='/home/CE/musaeed/kd-distiller/checkpoints/'
pd2en_small_model_preprocess='/home/CE/musaeed/FAKE_pd_en.tokenized.pd-en'
pd2en_large_model_preprocess="/home/mohammed_yahia3/models/BT_pd_en.tokenized.pd-en"
pd2en = TransformerModel.from_pretrained(pd2en_large_model_path ,
checkpoint_file='checkpoint_last.pt',
data_name_or_path=pd2en_large_model_preprocess,
bpe='sentencepiece',
sentencepiece_model='/home/mohammed_yahia3/models/pd__vocab_4000.model'
)
Hello! I have the save problem.
Environment:
torch 1.9.1+cu111
Python 3.8.8
Linux: Ubuntu 20.04.5 LTS
install fiarseq by pip cmd(pip install --editable ./)
(base) root@x517b-task0: /fairseq# git branch -r
origin/0.12.2-release
origin/0.12.3-release
origin/HEAD -> origin/main
origin/adaptor_pad_fix
origin/adding_womenbios
run.sh
CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train \
data-bin/wmt16_en_ro_first_hf \
--ddp-backend=legacy_ddp \
--arch transformer_wmt_en_de --share-decoder-input-output-embed -s 'en' -t 'ro' \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --max-epoch 2 \
--dropout 0.3 --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--eval-bleu \
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
--eval-bleu-detok moses \
--eval-bleu-remove-bpe \
--eval-bleu-print-samples \
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
--log-file ./log/wmt16_en_ro_first_hf_log.txt \
--scoring sacrebleu \
--max-tokens 8192 \
--save-dir checkpoints/wmt16_en_ro_first_hf/transformer_wmt_en_de \
--no-epoch-checkpoints \
--memory-efficient-fp16 \
--distributed-world-size 4 \
--nprocs-per-node 4 \
# evaluate
CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-generate data-bin/wmt16_en_ro_first_hf \
--path checkpoints/wmt16_en_ro_first_hf/transformer_wmt_en_de/checkpoint_last.pt \
--batch-size 128 --beam 5 --remove-bpe
I use the cmd above to obtain the checkpoint_last.pt
, however when I load the checkpoint, problem appears :
checkpoint = BaseFairseqModel.from_pretrained('checkpoints/wmt16_en_ro_first_hf/transformer_wmt_en_de', checkpoint_file="checkpoint_last.pt")
error info:
*** RuntimeError: Error(s) in loading state_dict for TransformerModel:
Unexpected key(s) in state_dict: "encoder.layers.0.in_proj_weight", "encoder.layers.0.in_proj_bias", "encoder.layers.0.out_proj_weight", "encoder.layers.0.out_proj_bias", "encoder.layers.0.fc1_weight", "encoder.layers.0.fc1_bias", "encoder.layers.0.fc2_weight", "encoder.layers.0.fc2_bias", "encoder.layers.1.in_proj_weight", "encoder.layers.1.in_proj_bias", "encoder.layers.1.out_proj_weight", "encoder.layers.1.out_proj_bias", "encoder.layers.1.fc1_weight", "encoder.layers.1.fc1_bias", "encoder.layers.1.fc2_weight", "encoder.layers.1.fc2_bias", "encoder.layers.2.in_proj_weight", "encoder.layers.2.in_proj_bias", "encoder.layers.2.out_proj_weight", "encoder.layers.2.out_proj_bias", "encoder.layers.2.fc1_weight", "encoder.layers.2.fc1_bias", "encoder.layers.2.fc2_weight", "encoder.layers.2.fc2_bias", "encoder.layers.3.in_proj_weight", "encoder.layers.3.in_proj_bias", "encoder.layers.3.out_proj_weight", "encoder.layers.3.out_proj_bias", "encoder.layers.3.fc1_weight", "encoder.layers.3.fc1_bias", "encoder.layers.3.fc2_weight", "encoder.layers.3.fc2_bias", "encoder.layers.4.in_proj_weight", "encoder.layers.4.in_proj_bias", "encoder.layers.4.out_proj_weight", "encoder.layers.4.out_proj_bias", "encoder.layers.4.fc1_weight", "encoder.layers.4.fc1_bias", "encoder.layers.4.fc2_weight", "encoder.layers.4.fc2_bias", "encoder.layers.5.in_proj_weight", "encoder.layers.5.in_proj_bias", "encoder.layers.5.out_proj_weight", "encoder.layers.5.out_proj_bias", "encoder.layers.5.fc1_weight", "encoder.layers.5.fc1_bias", "encoder.layers.5.fc2_weight", "encoder.layers.5.fc2_bias".
Since encoder.layers.0.in_proj_weight
should be encoder.layers.0.in_proj.weight
.
And when I use the torch.load()
.
(Pdb) pt =torch.load("checkpoints/wmt16_en_ro_first_hf/transformer_wmt_en_de/checkpoint_last.pt")
(Pdb) pt['model']["encoder.layers.0.in_proj_weight"]
tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
(Pdb) pt['model']["encoder.layers.0.in_proj.weight"]
*** KeyError: 'encoder.layers.0.in_proj.weight
I trained the checkpoint_last.pt
, but the param is 0. This is weird.
want to help
- why params like
encoder.layers.0.in_proj_weight
becomeencoder.layers.0.in_proj.weight
. - why params like
encoder.layers.0.in_proj_weight
is 0 after trained.
Have you solved this problem already?I'm still trying test my env by using an older version of fairseq0.11.1
and run the demo de-en tasks. Please let me know if you solved this problem.
I found the same error when training a model on Colab and launching fairseq-generate on my machine. Of course it is pretty absurd that Fairseq, which boasts its models are plain pytorch objects, cannot be launched on another machine. Ironically enough, if I convert the model with CTranslate2 it works just fine. Paradoxically, I can use a fairseq trained model with CTranslate2 but not with fairseq. Almost unbelievable.
I also met this bug. I print out the state_dict
of workable checkpoint & unworkable checkpoint from fairseq:
orig_bart_large_state_dict.txt
after_fairseq_train_state_dict.txt
I find that the checkpoint after the training from fairseq is added some unexpected keys:
for l in range(0, 11+1):
ignore_keys.append(f'encoder.layers.{l}.in_proj_weight')
ignore_keys.append(f'encoder.layers.{l}.in_proj_bias')
ignore_keys.append(f'encoder.layers.{l}.out_proj_weight')
ignore_keys.append(f'encoder.layers.{l}.out_proj_bias')
ignore_keys.append(f'encoder.layers.{l}.fc1_weight')
ignore_keys.append(f'encoder.layers.{l}.fc1_bias')
ignore_keys.append(f'encoder.layers.{l}.fc2_weight')
ignore_keys.append(f'encoder.layers.{l}.fc2_bias')
I pop them from the state_dict
and it works.
However, I didn't get deeper to know why this happened in fairseq's training process.
Not sure if this is fine. But the inference scores are acceptable.