nanoGPT
nanoGPT copied to clipboard
Source of downloading train.bin and val.bin
Hello, I am training gpt2 from scratch, but I found that the data processing of openwebtext is too slow, and our GPU server can't connect to the Internet. It's taken me 2 days and I feel like there's no other way. I would like to ask if you can open the download of train.bin and val.bin link?
@karpathy @otaviogood
just prepare this files on other computer. It's not need GPU, just lots of ram and disk space. Here https://huggingface.co/datasets/openwebtext/blob/main/openwebtext.py you may find link to https://zenodo.org/record/3834942/files/openwebtext.tar.xz which is openwebtext itself.
Also, i think it's possible to process dataset in parallel, maybe on multiple nodes. I'm working for my own prepare.py script for it. See my github, i'll commit it soon.
https://gist.github.com/ramiil/389faa6798df038d349212b19259f124 here is my prepare.py, working with multiple processor cores and able to load big datasets(over 10 gb), if single file in dataset less than your ram size. It's poor code, so enhance it if you can and want.
I also don't understand the inefficiency of so many systems endlessly doing the same work, what a waste of energy. I will leave the model when I have it on HF. For now I can unfortunately not help you with a link.