LLaMA-Factory
LLaMA-Factory copied to clipboard
PPO训练时Reward突然变成负数,loss出现突刺
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
cd LLaMA-Factory && HF_ENDPOINT=https://hf-mirror.com accelerate launch src/train_bash.py \
--stage sft \
--do_train \
--model_name_or_path codellama/CodeLlama-7b-Python-hf \
--dataset codealpaca,codeforces_python_submissions_sft \
--template default \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--output_dir output/sft/test_train \
--overwrite_cache \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 500 \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--plot_loss \
--fp16
cd LLaMA-Factory && HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 accelerate launch src/train_bash.py \
--stage rm \
--do_train \
--model_name_or_path codellama/CodeLlama-7b-Python-hf \
--adapter_name_or_path output/sft/test_train \
--create_new_adapter \
--dataset codeforces_python_submissions_rl \
--template default \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--output_dir output/rm/test_train \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 500 \
--learning_rate 1e-4 \
--num_train_epochs 1.0 \
--plot_loss \
--fp16
cd LLaMA-Factory && HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 accelerate launch src/train_bash.py \
--stage ppo \
--do_train \
--model_name_or_path codellama/CodeLlama-7b-Python-hf \
--adapter_name_or_path output/sft/test_train \
--create_new_adapter \
--dataset codealpaca,codeforces_python_submissions_sft \
--template default \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--reward_model output/rm/test_train \
--output_dir output/ppo/test_train \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--lr_scheduler_type cosine \
--top_k 0 \
--top_p 0.8 \
--logging_steps 10 \
--save_steps 100 \
--learning_rate 1e-5 \
--num_train_epochs 1.0 \
--plot_loss \
--fp16
Expected behavior
在PPO训练阶段,突然出现loss突刺,reward也变成了负数,这有可能是啥原因呢?
System Info
No response
Others
No response
我也遇到了这个问题,请问你解决了吗?
Reminder
- [x] I have read the README and searched the existing issues.
Reproduction
cd LLaMA-Factory && HF_ENDPOINT=https://hf-mirror.com accelerate launch src/train_bash.py \ --stage sft \ --do_train \ --model_name_or_path codellama/CodeLlama-7b-Python-hf \ --dataset codealpaca,codeforces_python_submissions_sft \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --output_dir output/sft/test_train \ --overwrite_cache \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 500 \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --plot_loss \ --fp16 cd LLaMA-Factory && HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 accelerate launch src/train_bash.py \ --stage rm \ --do_train \ --model_name_or_path codellama/CodeLlama-7b-Python-hf \ --adapter_name_or_path output/sft/test_train \ --create_new_adapter \ --dataset codeforces_python_submissions_rl \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --output_dir output/rm/test_train \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 16 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 500 \ --learning_rate 1e-4 \ --num_train_epochs 1.0 \ --plot_loss \ --fp16 cd LLaMA-Factory && HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 accelerate launch src/train_bash.py \ --stage ppo \ --do_train \ --model_name_or_path codellama/CodeLlama-7b-Python-hf \ --adapter_name_or_path output/sft/test_train \ --create_new_adapter \ --dataset codealpaca,codeforces_python_submissions_sft \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --reward_model output/rm/test_train \ --output_dir output/ppo/test_train \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 16 \ --lr_scheduler_type cosine \ --top_k 0 \ --top_p 0.8 \ --logging_steps 10 \ --save_steps 100 \ --learning_rate 1e-5 \ --num_train_epochs 1.0 \ --plot_loss \ --fp16
Expected behavior
在PPO训练阶段,突然出现loss突刺,reward也变成了负数,这有可能是啥原因呢?
System Info
No response
Others
No response
没有,也在排查SFT和RM。
可以看看 error analysis, 比如对比一下你那个出现下降 reward 的那个部分 和出现负数的部分,可能是 policy 模型没学习,从而和 SFT 模型的输出很相似,在公式上如果很相近,也会导致reward 下降,你可以打印一下这个值 \pi(x,y)/f_{sft}(x,y), 或者看看 advantage 的值,正常是要要对 adavange 进行裁剪的在,default 应该是包含了吧,不过可以查查。还有就是你可能的 sample 方式还是没学到足够好的,我看你用了 top-p, top-k 可能试试 best-of-n ,生成多个,取最好的去给模型学习