LLaMA-Factory
LLaMA-Factory copied to clipboard
单卡A100微调Qwen1.8B模型,loss一直为0
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
现象1
用单卡A100 80G,sft qwen1.8的模型,训练的loss一直为0?
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
--stage sft \
--do_train \
--model_name_or_path /home/qwen1_8 \
--dataset sft_sales_multilabel \
--dataset_dir ./data/sft_sales_multilabel \
--template chatml \
--finetuning_type full \
--output_dir saves/qwen1.8/sft \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 1024 \
--low_cpu_mem_usage false \
--preprocessing_num_workers 16 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--save_strategy 'epoch'\
--eval_steps 1000 \
--evaluation_strategy steps \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--val_size 0.1 \
--plot_loss \
--bf16
现象2
同样是单卡的A100,sft qwen1.8的模型,如果用deepspeed zero3训练可以训练,但是GPU的利用率一直很低; deepspeed的配置如下:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 3,
"overlap_comm": false,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
训练脚本如下:
#!/bin/bash
deepspeed --num_gpus 1 src/train_bash.py \
--deepspeed ds_z3_config.json \
--stage sft \
--do_train \
--model_name_or_path /home/qwen1_8 \
--dataset sft_sales_multilabel \
--dataset_dir ./data/sft_sales_multilabel \
--template chatml \
--finetuning_type full \
--output_dir saves/qwen1.8/sft \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 1024 \
--preprocessing_num_workers 16 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--save_strategy 'epoch'\
--eval_steps 1000 \
--evaluation_strategy steps \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--val_size 0.1 \
--plot_loss \
--fp16
Expected behavior
No response
System Info
python==3.9.18
tokenizers==0.15.2
torch==2.0.1
transformers==4.39.0
deepspeed==0.14.0
Others
No response