FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

Not all parameters were saved completely.

Open MagicBlueCH opened this issue 1 year ago • 4 comments

I found that when fine-tuning the 7B model, there were no error messages or interruptions, but not all parameters were saved completely.

Was a multi-GPU V100 used

torchrun --nproc_per_node=8 --master_port=20001 fastchat/train/train.py
--model_name_or_path /root/vicuna-7b
--data_path /root/data.json
--bf16 False
--output_dir /root/model
--num_train_epochs 3
--per_device_train_batch_size 2
--per_device_eval_batch_size 2
--gradient_accumulation_steps 16
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1200
--save_total_limit 10
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 False
--model_max_length 2048
--gradient_checkpointing True
--lazy_preprocess True
--deepspeed ./deepspeed.json image

MagicBlueCH avatar Apr 22 '23 13:04 MagicBlueCH

if you are using: https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train.py, then in line 246: replace safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir) with

    checkpoint_dir = os.path.join(training_args.output_dir, "checkpoint-final")
    trainer.deepspeed.save_checkpoint(checkpoint_dir)

then, checkpoint-final will contains zero_to_fp32.py after the training is done. just run python zero_to_fp32.py . pytorch_model.bin

for more information, look here: https://huggingface.co/transformers/v4.10.1/main_classes/deepspeed.html#getting-the-model-weights-out

luffycodes avatar Apr 24 '23 03:04 luffycodes

I am not sure we can directly use the method trainer.deepspeed? since our train.py script does not use any deepspeed functionality

zhisbug avatar Apr 25 '23 08:04 zhisbug

if you are using: https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train.py, then in line 246: replace safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir) with

    checkpoint_dir = os.path.join(training_args.output_dir, "checkpoint-final")
    trainer.deepspeed.save_checkpoint(checkpoint_dir)

then, checkpoint-final will contains zero_to_fp32.py after the training is done. just run python zero_to_fp32.py . pytorch_model.bin

for more information, look here: https://huggingface.co/transformers/v4.10.1/main_classes/deepspeed.html#getting-the-model-weights-out

"Excuse me, according to this script, pytorch_model.bin can be generated, but there is no configuration file or tokenizer. Can we just reuse the original Vicuna tokenizer configuration?"

MagicBlueCH avatar Apr 26 '23 02:04 MagicBlueCH

I am not sure we can directly use the method trainer.deepspeed? since our train.py script does not use any deepspeed functionality

I tested it and it is supported. Because train.py uses HfArgumentParser, which supports deepspeed parameters by default. I suggest that the project team add an optimization example for deepspeed so that most people can train using V100.

MagicBlueCH avatar Apr 26 '23 02:04 MagicBlueCH