pythia icon indicating copy to clipboard operation
pythia copied to clipboard

Has the data been shuffled?

Open Lisennlp opened this issue 1 year ago • 2 comments

Hello, I see your batch_view.py, found that the data does not use a shuffle, but in the gpt-neox library, the data is shuffled. So I want to make sure that the author did or did not shuffle during the training? Hope to get your answer, thank you!

Lisennlp avatar Nov 02 '23 09:11 Lisennlp

I think this might provide an answer https://github.com/EleutherAI/pythia/issues/123#issuecomment-1878882214

pietrolesci avatar Jan 09 '24 19:01 pietrolesci

The data is shuffled in terms of documents. The repo-id says preshuffled in https://github.com/EleutherAI/pythia#exploring-the-dataset, i.e., EleutherAI/pile-standard-pythia-preshuffled.

I'm actually not sure about https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps mentioned in https://github.com/EleutherAI/pythia#reproducing-training. I will add a quesiton about this on #123.

itsnamgyu avatar Jan 10 '24 04:01 itsnamgyu