unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

Dialogue length decrease when training Qwen2.5-1.5B with 16bit LORA GRPO RL

Open AdAstraAbyssoque opened this issue 2 weeks ago • 1 comments

Image

Description:

When training Qwen2.5-1.5B using 16bit LORA RL, I encountered a problem where the dialogue length decreased. This happened regardless of whether I used a model that had been pre-trained with COT SFT or the original Qwen2.5-1.5B. It's strange because I almost didn't change the reward, and only made some minor improvements such as adding the \box judgment to make it more precise. I was expecting an "aha moment" where the dialogue length would increase, but the opposite happened. I also reported this issue on the 7B base model.

Questions:

  • Is this a problem with LORA?
  • Is it a problem with the base model?
  • Is it a problem with the reward?

Environment: Model: Qwen2.5-1.5B and 7B base model Training method: 16bit LORA RL Modifications made: Added \boxed judgment to the reward function for more precision

AdAstraAbyssoque avatar Feb 10 '25 15:02 AdAstraAbyssoque