OpenRLHF
OpenRLHF copied to clipboard
adding length penalty to reward
Hi Team, While using the PPO pipeline we observe at times spikes in response length and were curious if any techniques related to length penalty is available or explored