stanford_alpaca High training loss of LLaMA 13B

I tried to train LLaMA 13B with the exact same configuration as 7B (except using deepspeed ZeRO stage 3) and found that the 13B model had an unusually high training loss (8 x A100 40G GPUs). Does anyone know why?

deepspeed config

{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "train_micro_batch_size_per_gpu": "auto"
}

training loss

Mar 19 '23 07:03 zwhe99

It is a good question!

Mar 19 '23 13:03 hujunchao

how about try to add a gradient clipping for 13B model? And also try gradient accumulation. I think this can handle the abnormal loss problem.

Mar 20 '23 01:03 ZeyuTeng96

gradient clipping

@ZeyuTeng96 Hi. Gradient accumulation was used, and max_grad_norm defaults to 1.

The following is the full configuration:

torchrun \
    --nnodes=$HOST_NUM \
    --nproc_per_node=$HOST_GPU_NUM \
    --rdzv_id=$TJ_INSTANCE_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$CHIEF_IP \
    --master_port=12345 \
train.py \
    --model_name_or_path $MODEL_PATH \
    --tokenizer_name_or_path $TOKENIZER_PATH \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir $OUTPUT_DIR \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --deepspeed ./deepspeed-cfg/ZeRO-3.json

Mar 20 '23 13:03 zwhe99

gradient clipping

@ZeyuTeng96 Hi. Gradient accumulation was used, and max_grad_norm defaults to 1.

The following is the full configuration:

torchrun \
    --nnodes=$HOST_NUM \
    --nproc_per_node=$HOST_GPU_NUM \
    --rdzv_id=$TJ_INSTANCE_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$CHIEF_IP \
    --master_port=12345 \
train.py \
    --model_name_or_path $MODEL_PATH \
    --tokenizer_name_or_path $TOKENIZER_PATH \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir $OUTPUT_DIR \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --deepspeed ./deepspeed-cfg/ZeRO-3.json

Hello my friend,

by using these settings. Did you train a model out? I used the quite similar settings with yours and a smaller alpaca dataset. My trained model cannot generate a relevant response and be very messy by providing trained instruction.

Mar 21 '23 13:03 ZeyuTeng96

It seems that you should add "gradient_accumulation_steps": "auto" to deepspeed config, otherwise, the gradient_accumulation_steps in deepspeed is still 1

Mar 22 '23 06:03 jyshee

Any updates here? I found an error said "Using --fsdp xxx together with --deepspeed is not possible, deactivate one of those flags.". Do we still need --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' this flag?

Mar 24 '23 03:03 zixiliuUSC

@ZeyuTeng96 @jyshee @zixiliuUSC Hi everyone, sorry for the late reply. According to @jyshee 's suggestion, I have successfully run the training of the 13B model. The following are all my configuration and training loss. But I haven't got the first checkpoint yet.

# train.sh
torchrun \
    --nnodes=$HOST_NUM \
    --nproc_per_node=$HOST_GPU_NUM \
    --rdzv_id=$TJ_INSTANCE_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$CHIEF_IP \
    --master_port=12345 \
train.py \
    --model_name_or_path $MODEL_PATH \
    --train_data_path $DATA \
    --bf16 True \
    --output_dir $OUTPUT_DIR \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "steps" \
    --eval_steps 2000 \
    --save_strategy "steps" \
    --save_steps 2000 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --deepspeed ./deepspeed-cfg/ZeRO-3.json

# ZeRO-3.json
{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

Mar 25 '23 10:03 zwhe99

@zwhe99 Hi I am reaching out regarding if you see any sub-optimal behavior of DeepSpeed-fine-tuned model in comparison to non-DeepSpeed fine-tuned model. Especially the behavior that it stops generation after repeating the prompt.

For example -

Non-DeepSpeed-fine-tuned model

Explain how algorithms can be used in educational institutions. Algorithms can be used in educational institutions to automate certain processes, such as grading tests and homework, providing personalized learning recommendations, and helping students find resources related to their coursework. Algorithms can also be used to track student progress, identify areas of difficulty, and provide feedback to students and teachers.

DeepSpeed fine-tuned mode

Explain how algorithms can be used in educational institutions.

Thanks!

Mar 29 '23 16:03 XinliYu

Hey @zwhe99 I got the model to train, but the weights aren't fully saved during checkpointing- even though I'm using the same ZeRO-3.json config and training settings. According to the hf deepspeed docs, the model state is supposed to be saved in a global_step*/*optim_states.pt, but these are missing. I'm using deespeed==0.8.3, transformers==4.27.0.dev0, accelerate==0.18.0, and torch==2.0.0.

Apr 01 '23 01:04 sabetAI

stanford_alpaca stanford_alpaca copied to clipboard

High training loss of LLaMA 13B

stanford_alpaca
stanford_alpaca copied to clipboard