RedPajama-Data icon indicating copy to clipboard operation
RedPajama-Data copied to clipboard

Expected finish time for processing one single index of commoncrawl?

Open kimcando opened this issue 1 year ago • 3 comments

One more question, please.

using the provided command, how long does it take to finish the each step(e.g, quality filtering, deduplication, quality classifier) for processing single index of commoncrawl(e.g, 2023-06 ) ?

Thank you!

kimcando avatar May 01 '23 08:05 kimcando

we used a machine with 64 cores and 512GB RAM and it took about 2-3 days for one CC dump to process with the cc_net pipeline. You can expect another day for deduplication and applying the quality classifier.

You can use the quality classifier that we have trained, so that you don't have to retrain it (this part of the readme points you the model).

mauriceweber avatar May 02 '23 13:05 mauriceweber

hello, how much the disk space will need? about 100T?

newbietuan avatar May 19 '23 06:05 newbietuan

@mauriceweber a single machine with 64 cores and 512 GB? for a single index?

kimcando avatar May 29 '23 17:05 kimcando