Megatron-LM
Megatron-LM copied to clipboard
About building *.bin and *.idx
Hi, thank you for your great work!
I've been using Megatron-LM for some time, and I've encountered some problems in building a large dataset.
I used preprocess_data.py to build a jsonl
(about 1TB) to *.bin and *.idx file; the server comes with 504GB memory.
But unfortunately, when the *.bin grows to about 600GB, the process seems to be dead. I wonder if there are some solution for big corpus, or will the lazy loader works?
Thank you:)