datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

JSONL loading slow when using megawarcs

Open cryptowooser opened this issue 8 months ago • 1 comments

I'm trying to run process_common_crawl_dump.py to dedupe an 80GB megawarc I have, and the jsonl loader is taking a long time to load the data. It appears to be single-threaded even if I set the number of workers higher. What are the best practices for working with megawarcs? Should I extract the files in advance of running datatrove?

cryptowooser avatar Jun 03 '24 03:06 cryptowooser