trl icon indicating copy to clipboard operation
trl copied to clipboard

PPO Questions

Open natolambert opened this issue 2 years ago • 5 comments

I'm comparing the PPO implementation to the OpenAI one and the implementation details blog post that goes through it. Wondering if some of these things improve performance. If not, it's good for understanding.

I'm guessing the discrepancy comes from the original vs learn to summarize work, which is interesting.

Some things to confirm:

  • [ ] PPO update question: I was a little confused seeing returns = advantages + values l693, instead of adv = returns - values why did it end up like that?
  • [x] Some implementations use a residual value prediction in clipping. Compared to TRL.
  • [x] consider approximate KL used in TRLX and discussed on john schulmans blog.

natolambert avatar Jan 28 '23 01:01 natolambert

It could be good to make things like this configurable in a branch and learning how these implementation details transfer to RLHF.

natolambert avatar Jan 28 '23 01:01 natolambert

imo, residual clipping seems beneficial to prevent policy loss spiking reported in #101 . It's probably coming from instability in value estimation.

DaehanKim avatar Jan 28 '23 14:01 DaehanKim

Yeah, I'm running residual clipping example(s), we'll see. At least it'll be good to have the option to try both.

natolambert avatar Jan 30 '23 19:01 natolambert

Residual value prediction didn't help with stability (it's crimson-wish) Screenshot 2023-01-30 at 1 41 28 PM

natolambert avatar Jan 30 '23 21:01 natolambert

Also not a big help via the other approx KL formulation. W&B here. Though, it's slightly more stable?

We'll see how this run finishes converging. Screenshot 2023-01-30 at 2 33 23 PM

natolambert avatar Jan 30 '23 22:01 natolambert

Closing this for now, feel free to reopen if there's an update.

lvwerra avatar Jun 01 '23 12:06 lvwerra