stanford_alpaca
stanford_alpaca copied to clipboard
High training loss of LLaMA 13B
I tried to train LLaMA 13B with the exact same configuration as 7B (except using deepspeed ZeRO stage 3) and found that the 13B model had an unusually high training loss (8 x A100 40G GPUs). Does anyone know why?
- deepspeed config
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"train_micro_batch_size_per_gpu": "auto"
}
- training loss
It is a good question!
how about try to add a gradient clipping for 13B model? And also try gradient accumulation. I think this can handle the abnormal loss problem.
gradient clipping
@ZeyuTeng96 Hi. Gradient accumulation was used, and max_grad_norm defaults to 1.
The following is the full configuration:
torchrun \
--nnodes=$HOST_NUM \
--nproc_per_node=$HOST_GPU_NUM \
--rdzv_id=$TJ_INSTANCE_ID \
--rdzv_backend=c10d \
--rdzv_endpoint=$CHIEF_IP \
--master_port=12345 \
train.py \
--model_name_or_path $MODEL_PATH \
--tokenizer_name_or_path $TOKENIZER_PATH \
--data_path ./alpaca_data.json \
--bf16 True \
--output_dir $OUTPUT_DIR \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "epoch" \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True \
--deepspeed ./deepspeed-cfg/ZeRO-3.json
gradient clipping
@ZeyuTeng96 Hi. Gradient accumulation was used, and max_grad_norm defaults to 1.
The following is the full configuration:
torchrun \ --nnodes=$HOST_NUM \ --nproc_per_node=$HOST_GPU_NUM \ --rdzv_id=$TJ_INSTANCE_ID \ --rdzv_backend=c10d \ --rdzv_endpoint=$CHIEF_IP \ --master_port=12345 \ train.py \ --model_name_or_path $MODEL_PATH \ --tokenizer_name_or_path $TOKENIZER_PATH \ --data_path ./alpaca_data.json \ --bf16 True \ --output_dir $OUTPUT_DIR \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "epoch" \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True \ --deepspeed ./deepspeed-cfg/ZeRO-3.json
Hello my friend,
by using these settings. Did you train a model out? I used the quite similar settings with yours and a smaller alpaca dataset. My trained model cannot generate a relevant response and be very messy by providing trained instruction.
It seems that you should add "gradient_accumulation_steps": "auto"
to deepspeed config, otherwise, the gradient_accumulation_steps in deepspeed is still 1
Any updates here? I found an error said "Using --fsdp xxx together with --deepspeed is not possible, deactivate one of those flags.". Do we still need --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
this flag?
@ZeyuTeng96 @jyshee @zixiliuUSC Hi everyone, sorry for the late reply. According to @jyshee 's suggestion, I have successfully run the training of the 13B model. The following are all my configuration and training loss. But I haven't got the first checkpoint yet.
# train.sh
torchrun \
--nnodes=$HOST_NUM \
--nproc_per_node=$HOST_GPU_NUM \
--rdzv_id=$TJ_INSTANCE_ID \
--rdzv_backend=c10d \
--rdzv_endpoint=$CHIEF_IP \
--master_port=12345 \
train.py \
--model_name_or_path $MODEL_PATH \
--train_data_path $DATA \
--bf16 True \
--output_dir $OUTPUT_DIR \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16 \
--evaluation_strategy "steps" \
--eval_steps 2000 \
--save_strategy "steps" \
--save_steps 2000 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True \
--deepspeed ./deepspeed-cfg/ZeRO-3.json
# ZeRO-3.json
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_micro_batch_size_per_gpu": "auto"
}

@zwhe99 Hi I am reaching out regarding if you see any sub-optimal behavior of DeepSpeed-fine-tuned model in comparison to non-DeepSpeed fine-tuned model. Especially the behavior that it stops generation after repeating the prompt.
For example -
Non-DeepSpeed-fine-tuned model
Explain how algorithms can be used in educational institutions. Algorithms can be used in educational institutions to automate certain processes, such as grading tests and homework, providing personalized learning recommendations, and helping students find resources related to their coursework. Algorithms can also be used to track student progress, identify areas of difficulty, and provide feedback to students and teachers.
DeepSpeed fine-tuned mode
Explain how algorithms can be used in educational institutions.
Thanks!
Hey @zwhe99 I got the model to train, but the weights aren't fully saved during checkpointing- even though I'm using the same ZeRO-3.json
config and training settings. According to the hf deepspeed docs, the model state is supposed to be saved in a global_step*/*optim_states.pt
, but these are missing. I'm using deespeed==0.8.3
, transformers==4.27.0.dev0
, accelerate==0.18.0
, and torch==2.0.0
.