RedPajama-Data
RedPajama-Data copied to clipboard
Expected finish time for processing one single index of commoncrawl?
One more question, please.
using the provided command, how long does it take to finish the each step(e.g, quality filtering, deduplication, quality classifier) for processing single index of commoncrawl(e.g, 2023-06 ) ?
Thank you!
we used a machine with 64 cores and 512GB RAM and it took about 2-3 days for one CC dump to process with the cc_net pipeline. You can expect another day for deduplication and applying the quality classifier.
You can use the quality classifier that we have trained, so that you don't have to retrain it (this part of the readme points you the model).
hello, how much the disk space will need? about 100T?
@mauriceweber a single machine with 64 cores and 512 GB? for a single index?