datatrove
datatrove copied to clipboard
JSONL loading slow when using megawarcs
I'm trying to run process_common_crawl_dump.py to dedupe an 80GB megawarc I have, and the jsonl loader is taking a long time to load the data. It appears to be single-threaded even if I set the number of workers higher. What are the best practices for working with megawarcs? Should I extract the files in advance of running datatrove?