RL4LMs icon indicating copy to clipboard operation
RL4LMs copied to clipboard

Persistent Variance in IMDB

Open mnoukhov opened this issue 2 years ago • 1 comments

In running experiments on IMDB, I found that there was a very high variance in validation and test set results and I don't fully understand it, so I'm looking for some advice.

Here, I've run PPO for 10 seeds using default hyperparameters

image

First of all, its clear that

  1. there is clearly a large variance in performance at epoch 0, which could be explained by randomness in the eval sampling during decoding
  2. there is a large variance in performance at epoch 50, which could be explained by randomness in RL

But together, we see runs that perform best at epoch 0 generally perform best on perplexity at epoch 50, which I can't explain. Here's the top 5 and bottom 5 based on initial perplexity scores, plotted against each other

image

Given that all models should be initialized to the pretrained model, there should be no randomness in initialization. So I'm confused as to how this is possible. Getting a lucky random seed for the initial validation should not affect the random seed for RL training, so why does the model that performs best at epoch 0 generally perform best at epoch 50?

Finally, I think the variance in results is high enough that I would recommend using 10 seeds for RL4LMs experiments

mnoukhov avatar Feb 02 '23 20:02 mnoukhov

  • So each run will have some randomness due to the dataset creation (we randomly select val and test samples) due to the large size of the original dataset.
  • Additionally, during decoding, there is randomness due to the sampling of tokens (both during epoch 0 and epoch 50)
  • Randomness in episode generation of PPO

Also can you tell the exact mean and sd of these runs and the corresponding config? We can double-check from our side too.

rajcscw avatar Feb 03 '23 22:02 rajcscw