trl PPO Questions

I'm comparing the PPO implementation to the OpenAI one and the implementation details blog post that goes through it. Wondering if some of these things improve performance. If not, it's good for understanding.

I'm guessing the discrepancy comes from the original vs learn to summarize work, which is interesting.

Some things to confirm:

[ ] PPO update question: I was a little confused seeing returns = advantages + values l693, instead of adv = returns - values why did it end up like that?
[x] Some implementations use a residual value prediction in clipping. Compared to TRL.
[x] consider approximate KL used in TRLX and discussed on john schulmans blog.

Jan 28 '23 01:01 natolambert

It could be good to make things like this configurable in a branch and learning how these implementation details transfer to RLHF.

Jan 28 '23 01:01 natolambert

imo, residual clipping seems beneficial to prevent policy loss spiking reported in #101 . It's probably coming from instability in value estimation.

Jan 28 '23 14:01 DaehanKim

Yeah, I'm running residual clipping example(s), we'll see. At least it'll be good to have the option to try both.

Jan 30 '23 19:01 natolambert

Residual value prediction didn't help with stability (it's crimson-wish) Screenshot 2023-01-30 at 1 41 28 PM

Jan 30 '23 21:01 natolambert

Also not a big help via the other approx KL formulation. W&B here. Though, it's slightly more stable?

We'll see how this run finishes converging. Screenshot 2023-01-30 at 2 33 23 PM

Jan 30 '23 22:01 natolambert

Closing this for now, feel free to reopen if there's an update.

Jun 01 '23 12:06 lvwerra

trl trl copied to clipboard

PPO Questions

trl
trl copied to clipboard