unsloth
unsloth copied to clipboard
Dialogue length decrease when training Qwen2.5-1.5B with 16bit LORA GRPO RL
Description:
When training Qwen2.5-1.5B using 16bit LORA RL, I encountered a problem where the dialogue length decreased. This happened regardless of whether I used a model that had been pre-trained with COT SFT or the original Qwen2.5-1.5B. It's strange because I almost didn't change the reward, and only made some minor improvements such as adding the \box judgment to make it more precise. I was expecting an "aha moment" where the dialogue length would increase, but the opposite happened. I also reported this issue on the 7B base model.
Questions:
- Is this a problem with LORA?
- Is it a problem with the base model?
- Is it a problem with the reward?
Environment: Model: Qwen2.5-1.5B and 7B base model Training method: 16bit LORA RL Modifications made: Added \boxed judgment to the reward function for more precision