nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

Source of downloading train.bin and val.bin

Open TccccD opened this issue 1 year ago • 3 comments

Hello, I am training gpt2 from scratch, but I found that the data processing of openwebtext is too slow, and our GPU server can't connect to the Internet. It's taken me 2 days and I feel like there's no other way. I would like to ask if you can open the download of train.bin and val.bin link?

@karpathy @otaviogood

TccccD avatar Mar 12 '23 08:03 TccccD

just prepare this files on other computer. It's not need GPU, just lots of ram and disk space. Here https://huggingface.co/datasets/openwebtext/blob/main/openwebtext.py you may find link to https://zenodo.org/record/3834942/files/openwebtext.tar.xz which is openwebtext itself.

Also, i think it's possible to process dataset in parallel, maybe on multiple nodes. I'm working for my own prepare.py script for it. See my github, i'll commit it soon.

ramiil avatar Mar 20 '23 20:03 ramiil

https://gist.github.com/ramiil/389faa6798df038d349212b19259f124 here is my prepare.py, working with multiple processor cores and able to load big datasets(over 10 gb), if single file in dataset less than your ram size. It's poor code, so enhance it if you can and want.

ramiil avatar Mar 22 '23 10:03 ramiil

I also don't understand the inefficiency of so many systems endlessly doing the same work, what a waste of energy. I will leave the model when I have it on HF. For now I can unfortunately not help you with a link.

dikkietrom avatar May 04 '23 08:05 dikkietrom