NeMo-Curator
NeMo-Curator copied to clipboard
Resuming the job on slurm after it gets cancelled.
trafficstars
Initial discussion happened with @VibhuJawa
Is your feature request related to a problem? Please describe. While running a workflow on slurm with large files, if it needs to be cancelled or stopped due to 4 hr slurm job restriction, we will have partial results. Can we have robust resume feature which can handle even if files are partially processed.
Describe the solution you'd like Merging output file parts (0.part, 1.part, ...) into 1 file and compare with input file for checking whether complete processing has happened or not, if not then while job resumes the remaining file part will also be counted in set which will go for processing.