NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Add example of how to resume an interrupted `download_common_crawl` job

Open sarahyurick opened this issue 6 months ago • 0 comments
trafficstars

Since download_common_crawl can be quite a large job (with ~100,000 WARC files per full snapshot), it is possible that a user's job may be interrupted, stopped, or cancelled for various reasons. In this case, thousands or tens of thousands of files could already be downloaded, so we do not want to have to start the entire process over again.

Luckily, there are several ways to resume the job without redoing the files that were already extracted. We should add an example script and/or tutorial for how to do this.

Alternatively, we could enable this on the download_common_crawl module side, but this approach might be more involved and/or create issues depending on the user's set up. For now, I think an example for users to reference should be sufficient.

sarahyurick avatar May 16 '25 18:05 sarahyurick