Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

About building *.bin and *.idx

Open Yijia-Xiao opened this issue 3 years ago • 5 comments

Hi, thank you for your great work! I've been using Megatron-LM for some time, and I've encountered some problems in building a large dataset. I used preprocess_data.py to build a jsonl (about 1TB) to *.bin and *.idx file; the server comes with 504GB memory. But unfortunately, when the *.bin grows to about 600GB, the process seems to be dead. I wonder if there are some solution for big corpus, or will the lazy loader works?

Thank you:)

Yijia-Xiao avatar Oct 29 '21 02:10 Yijia-Xiao