Megatron-LM
Megatron-LM copied to clipboard
Incorrect shuffling of documents across epochs in GPTDataset
Incorrect Dataset Shuffling
- Currently, in
gpt_dataset.py
, the dataset is being globally shuffled across epochs rather than within epoch shuffling which is the standard. - Both shuffle index code and document index code are being shuffled across epochs.
Question Has this been done on purpose? Is there any reason to prefer global shuffling over per-epoch shuffling?
Solution Shuffle data per epoch instead of shuffling the full data. Implementation is straightforward. However, we need to fix both document and shuffle index to fix the overall problem.
I did some preliminary experiments with the same -
- One the simulate (or directly measure this shuffling), and a significant fraction (~30%) of the samples are never seen in a given epoch due to shuffling across epochs.
- On small-scale training a GPT-style model with Wikipedia, correcting this shuffling does not seem to lead to any performance improvements - I estimate because LMs continue to improve over even after repeating a few epochs on input data, so this issue of shuffling does not impact them significantly negatively.
Marking as stale. No activity in 60 days.