Liger-Kernel Generalized PPO loss (& improve current GRPO loss)

Generalized PPO loss (& improve current GRPO loss)

Open qingquansong opened this issue 8 months ago • 4 comments

[ ] 1. Current GRPO assume KL term is added and advantage is computed inside the loss, we wanna open this to become configurable by user
[ ] 2. Clipping is not used in GRPO and need to be added (also as an option for generalized PPO case)
[ ] 3. Reference model logits/prob can be provided directly without providing LM head + hidden states as an option
[ ] 4. Old policy prob or LM head/hidden states is not provided for importance sampling purpose in GRPO.

No response

No response

Mar 24 '25 21:03 qingquansong

Assigned to me now and if anyone is interested in anyone of the features, feel free to take some.

Mar 24 '25 21:03 qingquansong

Hi @qingquansong , I want to pick the item 1 first, if I finished it earlier, I can take more.

Mar 24 '25 22:03 mRSun15

yes the implementation in liger is older so would need updating to the newer version in TRL

Mar 25 '25 13:03 kashif

worked on #628 ... didn't see this issue before so was working on this independently In my implementation:

KL is only calculated if beta is non-zero. Assume user directly provides the advantage instead of reward (otherwise we need all gather to compute reward mean/std)
Added clipping
Not supported yet
added an option for old policy prob Also added fix for loss correctness, metrics calculation, testing, etc.

Mar 26 '25 15:03 shivam15s