Liger-Kernel icon indicating copy to clipboard operation
Liger-Kernel copied to clipboard

Generalized PPO loss (& improve current GRPO loss)

Open qingquansong opened this issue 8 months ago • 4 comments

🚀 The feature, motivation and pitch

  • [ ] 1. Current GRPO assume KL term is added and advantage is computed inside the loss, we wanna open this to become configurable by user
  • [ ] 2. Clipping is not used in GRPO and need to be added (also as an option for generalized PPO case)
  • [ ] 3. Reference model logits/prob can be provided directly without providing LM head + hidden states as an option
  • [ ] 4. Old policy prob or LM head/hidden states is not provided for importance sampling purpose in GRPO.

Alternatives

No response

Additional context

No response

qingquansong avatar Mar 24 '25 21:03 qingquansong

Assigned to me now and if anyone is interested in anyone of the features, feel free to take some.

qingquansong avatar Mar 24 '25 21:03 qingquansong

Hi @qingquansong , I want to pick the item 1 first, if I finished it earlier, I can take more.

mRSun15 avatar Mar 24 '25 22:03 mRSun15

yes the implementation in liger is older so would need updating to the newer version in TRL

kashif avatar Mar 25 '25 13:03 kashif

worked on #628 ... didn't see this issue before so was working on this independently In my implementation:

  1. KL is only calculated if beta is non-zero. Assume user directly provides the advantage instead of reward (otherwise we need all gather to compute reward mean/std)
  2. Added clipping
  3. Not supported yet
  4. added an option for old policy prob Also added fix for loss correctness, metrics calculation, testing, etc.

shivam15s avatar Mar 26 '25 15:03 shivam15s