FastChat
FastChat copied to clipboard
Continue training from checkpoint raise RuntimeError
Duplicate of #540
transformers 4.28.0 fschat 0.2.2
Successfuly finetuned llamma-13b using the following arguments:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 torchrun --nproc_per_node=7 --master_port=2001 fastchat/train/train_mem.py --model_name_or_path huggyllama/llama-13b --data_path /root/autodl-data/code/hjw/FastChat/fastchat/data/temp_1k.json --bf16 True --output_dir output --num_train_epochs 1 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 16 --evaluation_strategy "no" --save_strategy "steps" --save_steps 10 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 True --model_max_length 2048 --gradient_checkpointing True --lazy_preprocess True --cache_dir /root/autodl-data/model
But the continue training from the previous saved weights got the following error:
╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /root/autodl-data/code/hjw/FastChat/fastchat/train/train_mem.py:11 in │ │
│ │ │ │ 8 from fastchat.train.train import train │ │ 9 │ │ 10 if name == "main": │ │ ❱ 11 │ train() │ │ 12 │ │ │ │ /root/autodl-data/tools/anaconda3/envs/hjwchat/lib/python3.10/site-packages/ │ │ fastchat/train/train.py:244 in train │ │ │ │ 241 │ │ │ 242 │ if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*" │ │ 243 │ │ print("resume from checkpoint.......") │ │ ❱ 244 │ │ trainer.train(resume_from_checkpoint=True) │ │ 245 │ else: │ │ 246 │ │ trainer.train() │ │ 247 │ trainer.save_state() │ │ │ │ /root/autodl-data/tools/anaconda3/envs/hjwchat/lib/python3.10/site-packages/ │ │ transformers/trainer.py:1996 in inner_training_loop │ │ │ │ 1993 │ │ │ │ │ │ scale_after = self.scaler.get_scale() │ │ 1994 │ │ │ │ │ │ optimizer_was_run = scale_before <= scale_aft │ │ 1995 │ │ │ │ │ else: │ │ ❱ 1996 │ │ │ │ │ │ self.optimizer.step() │ │ 1997 │ │ │ │ │ │ │ 1998 │ │ │ │ │ if optimizer_was_run and not self.deepspeed: │ │ 1999 │ │ │ │ │ │ self.lr_scheduler.step() │ │ │ │ /root/autodl-data/tools/anaconda3/envs/hjwchat/lib/python3.10/site-packages/ │ │ torch/optim/adamw.py:273 in single_tensor_adamw │ │ │ │ 270 │ │ param.mul(1 - lr * weight_decay) │ │ 271 │ │ │ │ 272 │ │ # Decay the first and second moment running average coefficien │ │ ❱ 273 │ │ exp_avg.mul(beta1).add_(grad, alpha=1 - beta1) │ │ 274 │ │ exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2) │ │ 275 │ │ │ │ 276 │ │ if capturable: │ ╰──────────────────────────────────────────────────────────────────────────────╯ RuntimeError: The size of tensor a (46812160) must match the size of tensor b (81921280) at non-singleton dimension 0