RedPajama-Data icon indicating copy to clipboard operation
RedPajama-Data copied to clipboard

Running full pipeline on a small part of CC

Open zhentingqi opened this issue 2 years ago • 0 comments

Hi! Can anyone please tell me how to run the full mining pipeline using cc_net on just a very small portion of CC? E.g., I just want to around 100M cleaned data of the newest crawl 2023-50. Thanks!

zhentingqi avatar Feb 05 '24 21:02 zhentingqi