RL4LMs Persistent Variance in IMDB

Persistent Variance in IMDB

Open mnoukhov opened this issue 2 years ago • 1 comments

In running experiments on IMDB, I found that there was a very high variance in validation and test set results and I don't fully understand it, so I'm looking for some advice.

Here, I've run PPO for 10 seeds using default hyperparameters

First of all, its clear that

there is clearly a large variance in performance at epoch 0, which could be explained by randomness in the eval sampling during decoding
there is a large variance in performance at epoch 50, which could be explained by randomness in RL

But together, we see runs that perform best at epoch 0 generally perform best on perplexity at epoch 50, which I can't explain. Here's the top 5 and bottom 5 based on initial perplexity scores, plotted against each other

Given that all models should be initialized to the pretrained model, there should be no randomness in initialization. So I'm confused as to how this is possible. Getting a lucky random seed for the initial validation should not affect the random seed for RL training, so why does the model that performs best at epoch 0 generally perform best at epoch 50?

Finally, I think the variance in results is high enough that I would recommend using 10 seeds for RL4LMs experiments

Feb 02 '23 20:02 mnoukhov

So each run will have some randomness due to the dataset creation (we randomly select val and test samples) due to the large size of the original dataset.
Additionally, during decoding, there is randomness due to the sampling of tokens (both during epoch 0 and epoch 50)
Randomness in episode generation of PPO

Also can you tell the exact mean and sd of these runs and the corresponding config? We can double-check from our side too.

Feb 03 '23 22:02 rajcscw

RL4LMs RL4LMs copied to clipboard

Persistent Variance in IMDB

RL4LMs
RL4LMs copied to clipboard