Liger-Kernel
Liger-Kernel copied to clipboard
Generalized PPO loss (& improve current GRPO loss)
🚀 The feature, motivation and pitch
- [ ] 1. Current GRPO assume KL term is added and advantage is computed inside the loss, we wanna open this to become configurable by user
- [ ] 2. Clipping is not used in GRPO and need to be added (also as an option for generalized PPO case)
- [ ] 3. Reference model logits/prob can be provided directly without providing LM head + hidden states as an option
- [ ] 4. Old policy prob or LM head/hidden states is not provided for importance sampling purpose in GRPO.
Alternatives
No response
Additional context
No response
Assigned to me now and if anyone is interested in anyone of the features, feel free to take some.
Hi @qingquansong , I want to pick the item 1 first, if I finished it earlier, I can take more.
yes the implementation in liger is older so would need updating to the newer version in TRL
worked on #628 ... didn't see this issue before so was working on this independently In my implementation:
- KL is only calculated if beta is non-zero. Assume user directly provides the advantage instead of reward (otherwise we need all gather to compute reward mean/std)
- Added clipping
- Not supported yet
- added an option for old policy prob Also added fix for loss correctness, metrics calculation, testing, etc.