FastChat
FastChat copied to clipboard
Not all parameters were saved completely.
I found that when fine-tuning the 7B model, there were no error messages or interruptions, but not all parameters were saved completely.
Was a multi-GPU V100 used
torchrun --nproc_per_node=8 --master_port=20001 fastchat/train/train.py
--model_name_or_path /root/vicuna-7b
--data_path /root/data.json
--bf16 False
--output_dir /root/model
--num_train_epochs 3
--per_device_train_batch_size 2
--per_device_eval_batch_size 2
--gradient_accumulation_steps 16
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1200
--save_total_limit 10
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 False
--model_max_length 2048
--gradient_checkpointing True
--lazy_preprocess True
--deepspeed ./deepspeed.json
if you are using: https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train.py,
then
in line 246: replace safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
with
checkpoint_dir = os.path.join(training_args.output_dir, "checkpoint-final")
trainer.deepspeed.save_checkpoint(checkpoint_dir)
then, checkpoint-final will contains zero_to_fp32.py after the training is done.
just run python zero_to_fp32.py . pytorch_model.bin
for more information, look here: https://huggingface.co/transformers/v4.10.1/main_classes/deepspeed.html#getting-the-model-weights-out
I am not sure we can directly use the method trainer.deepspeed
? since our train.py script does not use any deepspeed functionality
if you are using: https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train.py, then in line 246: replace
safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
withcheckpoint_dir = os.path.join(training_args.output_dir, "checkpoint-final") trainer.deepspeed.save_checkpoint(checkpoint_dir)
then, checkpoint-final will contains zero_to_fp32.py after the training is done. just run
python zero_to_fp32.py . pytorch_model.bin
for more information, look here: https://huggingface.co/transformers/v4.10.1/main_classes/deepspeed.html#getting-the-model-weights-out
"Excuse me, according to this script, pytorch_model.bin can be generated, but there is no configuration file or tokenizer. Can we just reuse the original Vicuna tokenizer configuration?"
I am not sure we can directly use the method
trainer.deepspeed
? since our train.py script does not use any deepspeed functionality
I tested it and it is supported. Because train.py uses HfArgumentParser, which supports deepspeed parameters by default. I suggest that the project team add an optimization example for deepspeed so that most people can train using V100.