Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

Incorrect shuffling of documents across epochs in GPTDataset

Open argitrage opened this issue 1 year ago • 1 comments

Incorrect Dataset Shuffling

  • Currently, in gpt_dataset.py, the dataset is being globally shuffled across epochs rather than within epoch shuffling which is the standard.
  • Both shuffle index code and document index code are being shuffled across epochs.

Question Has this been done on purpose? Is there any reason to prefer global shuffling over per-epoch shuffling?

Solution Shuffle data per epoch instead of shuffling the full data. Implementation is straightforward. However, we need to fix both document and shuffle index to fix the overall problem.

argitrage avatar Feb 20 '24 03:02 argitrage

I did some preliminary experiments with the same -

  1. One the simulate (or directly measure this shuffling), and a significant fraction (~30%) of the samples are never seen in a given epoch due to shuffling across epochs.
  2. On small-scale training a GPT-style model with Wikipedia, correcting this shuffling does not seem to lead to any performance improvements - I estimate because LMs continue to improve over even after repeating a few epochs on input data, so this issue of shuffling does not impact them significantly negatively.

akhilkedia avatar Apr 08 '24 12:04 akhilkedia

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jun 07 '24 18:06 github-actions[bot]