RL4LMs Reproducing IMDB results

Reproducing IMDB results

Open mnoukhov opened this issue 1 year ago • 4 comments

Hi, I'm currently running the imdb experiments and trying to reproduce the PPO and NLPO results from the paper and though my PPO is close, NLPO is quite far from reported. Do you have any advice for reproducing NLPO results?

I'm running the default config (scripts/training/task_configs/imdb_text_continuation/gpt2_{ppo,nlpo}.yml) and the final test results compared to the results from the paper are below.

	Sentiment Score	Fluency (Perplexity)
zero-shot (ppo)	0.486	32.4
ppo	0.604	33.0
zero-shot (nlpo)	0.497	32.7
nlpo	0.496	40.8
paper's zero-shot	0.489	32.2
paper's ppo	0.605	33.5
paper's nlpo	0.637	32.7

PPO results are similar and even slightly lower ppl but NLPO is not at all close. Here are the validation curves

NLPO also seems to improve in sentiment for a bit and then suddenly stops and decreases but all the while the perplexity is going up. Comparing the training curves, it seems that approx KL loss is much larger for NLPO but this could be reasonable given the changes in NLPO. Do you see similar curves?

Finally, in the paper's Appendix Table 4 it says that it runs for 10 epochs but in Figure 4 just below (and also based on the wandb logging) these experiments are for 50 epochs. Should I be running for 10 or 50 epochs?

Each experiment is being run on 4 A100 GPUs as per #12

Dec 30 '22 23:12 mnoukhov

RL4LMs RL4LMs copied to clipboard

Reproducing IMDB results

RL4LMs
RL4LMs copied to clipboard