TinyLlama Why is there a significant drop in `val_ppl` after fixing data-loading bug?

Why is there a significant drop in `val_ppl` after fixing data-loading bug?

Open bobqianic opened this issue 2 years ago • 7 comments

trafficstars

Oct 25 '23 22:10 bobqianic

Hmm, thats really interesting.

thinking about it, repeated data could have distorted the data patterns, making the model look at data in ways it shouldn't have, which could have affected the val performance.

From the graphs, train loss is the exact same, on track, yet val has a sudden drop in ppl and loss. The Data bugfix was important and improved performance, but I think should have affected the train values. Maybe were observing some form of grokking?

@jzhang38 do you have any reasons for this?

Oct 26 '23 22:10 VatsaDev

We sample 100 iters of val data for the actual validation. I believe this is caused by the fact that a different partition of val data gets sampled after the dataloader fix.

Oct 31 '23 03:10 jzhang38

Just different sampling should not drop the loss by that much. I believe this needs a more thorough investigation why this is happening.

But maybe it has something to do with now picking starcoder vs slimpajam or the other way around.

Oct 31 '23 15:10 psinger

If dataset processing was repeated between the runs it's possible the train/val split is different; if any of the data in the validation set was previously in the training set it would account for the sharp discontinuity more neatly than a different sample of val (where a random subset of 100 should really be enough that any sample delivers almost the same number).

If this is correct it doesn't mean that the val_ppl's trajectory is a bad metric, but it does mean that it isn't truly a validation set any more.

Oct 31 '23 18:10 segyges

Looks like its been fixed with train run 2? was the previous val data something it might have trained on? Whats the deduplication of the data?

Oct 31 '23 18:10 VatsaDev

It also seems odd that the fix didn't seem to have any effect on train_loss. Allegedly, 35% (or perhaps even 60%) of the training data was not being used previously. How can introducing that much new data not have any effect on train_loss, especially early after the restart? That's a huge distribution shift.

Nov 01 '23 02:11 eminorhan

@eminorhan I had similar thoughts, but its also possible that in 35% of new data, slipajama might be more uniform, or it could have been code data, which also has uniformity. Its strange that the Val PPL still dropping with the extra 30%, yet benchmarks are growing consistent to previous versions. is the val check contaminated?

Nov 07 '23 23:11 VatsaDev

TinyLlama TinyLlama copied to clipboard

Why is there a significant drop in `val_ppl` after fixing data-loading bug?

TinyLlama
TinyLlama copied to clipboard