trl
trl copied to clipboard
PPO Questions
I'm comparing the PPO implementation to the OpenAI one and the implementation details blog post that goes through it. Wondering if some of these things improve performance. If not, it's good for understanding.
I'm guessing the discrepancy comes from the original vs learn to summarize work, which is interesting.
Some things to confirm:
- [ ] PPO update question: I was a little confused seeing
returns = advantages + values
l693, instead ofadv = returns - values
why did it end up like that? - [x] Some implementations use a residual value prediction in clipping. Compared to TRL.
- [x] consider approximate KL used in TRLX and discussed on john schulmans blog.
It could be good to make things like this configurable in a branch and learning how these implementation details transfer to RLHF.
imo, residual clipping seems beneficial to prevent policy loss spiking reported in #101 . It's probably coming from instability in value estimation.
Yeah, I'm running residual clipping example(s), we'll see. At least it'll be good to have the option to try both.
Residual value prediction didn't help with stability (it's crimson-wish)
Also not a big help via the other approx KL formulation. W&B here. Though, it's slightly more stable?
We'll see how this run finishes converging.
Closing this for now, feel free to reopen if there's an update.