NeMo-Curator
NeMo-Curator copied to clipboard
Add example of how to resume an interrupted `download_common_crawl` job
Since download_common_crawl can be quite a large job (with ~100,000 WARC files per full snapshot), it is possible that a user's job may be interrupted, stopped, or cancelled for various reasons. In this case, thousands or tens of thousands of files could already be downloaded, so we do not want to have to start the entire process over again.
Luckily, there are several ways to resume the job without redoing the files that were already extracted. We should add an example script and/or tutorial for how to do this.
Alternatively, we could enable this on the download_common_crawl module side, but this approach might be more involved and/or create issues depending on the user's set up. For now, I think an example for users to reference should be sufficient.