fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Error When fine-tuning from others model.bin : File "~/fairseq/fairseq/checkpoint_utils.py", line 581, in _upgrade_state_dict {"criterion_name": "CrossEntropyCriterion", "best_loss": state["best_loss"]} KeyError: 'best_loss'

Open muhammed-saeed opened this issue 1 year ago • 3 comments

🐛 Bug

Hi, I am trying to do more pre-training epochs of Roberta's pre-trained model on some low-resource - English language- the model I want to use is the following https://huggingface.co/roberta-base/tree/main . I downloaded the pytorch_model.bin and pass it as the checkpoint path and then using fairseq train like shown below but I always encounter the same error "~/fairseq/fairseq/checkpoint_utils.py", line 581, in _upgrade_state_dict {"criterion_name": "CrossEntropyCriterion", "best_loss": state["best_loss"]} KeyError: 'best_loss'

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

  1. Run cmd '....'
  2. See error

Code sample

TOTAL_NUM_UPDATES=7812 # 10 epochs through IMDB for bsz 32 WARMUP_UPDATES=469 # 6 percent of the number of updates LR=1e-05 # Peak LR for polynomial LR scheduler. HEAD_NAME="pcm_head_more_fine_tuning_roberat_base" # Custom name for the classification head. NUM_CLASSES=3 # Number of classes for the classification task. MAX_SENTENCES=8 # Batch size. ROBERTA_PATH="/home/CE/musaeed/ironside_nmt/ROBERTA-base-en/config_files/pytorch_model.bin"

ROBERTA_PATH = "/content/checkpoint_best.pt"

CUDA_VISIBLE_DEVICES=0

fairseq-train /home/CE/musaeed/ironside_nmt/ROBERTA-base-en/checher_pcm-bin/ \

fairseq-train /home/CE/musaeed/ironside_nmt/ironside_roberta/pcm_roberta_fairseq/data-bin/pcm-bin/
--restore-file $ROBERTA_PATH
--max-positions 514
--batch-size $MAX_SENTENCES
--max-tokens 4400
--task sentence_prediction
--reset-optimizer --reset-dataloader --reset-meters
--required-batch-size-multiple 1
--init-token 0 --separator-token 2
--arch roberta
--criterion sentence_prediction
--classification-head-name $HEAD_NAME
--num-classes $NUM_CLASSES
--dropout 0.1 --attention-dropout 0.1
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06
--clip-norm 0.0
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES
--fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128
--max-epoch 10
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric
--shorten-method "truncate"
--find-unused-parameters
--update-freq 4
--wandb-project "fine-tuing ROBERTa-en-base"

Expected behavior

Environment

  • fairseq Version (e.g., 1.0 or main):
  • PyTorch Version (e.g., 1.0)
  • OS (e.g., Linux):
  • How you installed fairseq (pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

muhammed-saeed avatar Jul 06 '22 12:07 muhammed-saeed

any update ? @muhammed-saeed

1029694141 avatar Aug 30 '22 01:08 1029694141

Yup, I have managed to solve the issue. I realized the align folder was named train.bpe.align but when changed the name into train.algin it works, I think it depends on the suffix you pass during the fairseq-preprocess like for my case the suffix was align so I think that is the reason why the file has to be named this way

muhammed-saeed avatar Sep 02 '22 21:09 muhammed-saeed

Hi, I came across similar issue before, I think it's caused by the difference of checkpoint file. The checkpoint file saved by fairseq is a 'dict' variable with keys such as 'args', 'cfg', 'model', 'last optimizer state' and so on, while checkpoint file downloaded from hugging face is an 'OrderedDict' variable which is same as key 'model' in fairseq checkpoint file. So, i solved it by adding some 'fake' params into hugging face checkpoint file to match the code, because what really matters is the model params.

XueMoonLit avatar Dec 05 '22 08:12 XueMoonLit

Reference

@XueMoonLit Do you have a codesnippet of how you achieved this?

KiriKoppelgaard avatar Mar 10 '23 10:03 KiriKoppelgaard

Reference

@XueMoonLit Do you have a codesnippet of how you achieved this? Hi, I modified some code of function "_upgrade_state_dict" in checkpoint_utils.py, just add some keys to "state" according to the error logs. Maybe remove the call to this function from "load_checkpoint_to_cpu" in the same file also works. `def _upgrade_state_dict(state): """Helper for upgrading old model checkpoints.""" if "best_loss" not in state: state["best_loss"] = 2.0 if "extra_state" not in state: state["extra_state"] = {"epoch": 0}

# add optimizer_history
if "optimizer_history" not in state:
    state["optimizer_history"] = [
        {"criterion_name": "LabelSmoothedCrossEntropyCriterion", "best_loss": state["best_loss"]}
    ]
# reduce optimizer history's memory usage (only keep the last state)
if "optimizer" in state["optimizer_history"][-1]:
    state["last_optimizer_state"] = state["optimizer_history"][-1]["optimizer"]
    for optim_hist in state["optimizer_history"]:
        del optim_hist["optimizer"]
# record the optimizer class name
if "optimizer_name" not in state["optimizer_history"][-1]:
    state["optimizer_history"][-1]["optimizer_name"] = "FairseqNAG"
# move best_loss into lr_scheduler_state
if "lr_scheduler_state" not in state["optimizer_history"][-1]:
    state["optimizer_history"][-1]["lr_scheduler_state"] = {
        "best": state["optimizer_history"][-1]["best_loss"]
    }
    del state["optimizer_history"][-1]["best_loss"]
# keep track of number of updates
if "num_updates" not in state["optimizer_history"][-1]:
    state["optimizer_history"][-1]["num_updates"] = 0
# use stateful training data iterator
if "train_iterator" not in state["extra_state"]:
    state["extra_state"]["train_iterator"] = {
        "epoch": state["extra_state"].get("epoch", 0),
    }`

XueMoonLit avatar Mar 10 '23 15:03 XueMoonLit

@XueMoonLit Ah, sure. Thanks :)

Didn't you get the following error after adding the dummy keys? File "/home/kikop/miniconda3/envs/xls-r/lib/python3.10/site-packages/fairseq/models/wav2vec/wav2vec2_asr.py", line 381, in __init__ w2v_args = convert_namespace_to_omegaconf(state["args"]) KeyError: 'args'

I guess the huggingface 'OrderedDict' still isn't saved under the key 'cfg' or 'args'. Did you also find a way to circumvene this?

KiriKoppelgaard avatar Mar 14 '23 08:03 KiriKoppelgaard

maybe its somehow related to the way the state_dict is stored

muhammed-saeed avatar Mar 26 '23 20:03 muhammed-saeed