NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Resuming the job on slurm after it gets cancelled.

Open uahmed93 opened this issue 1 year ago • 0 comments
trafficstars

Initial discussion happened with @VibhuJawa

Is your feature request related to a problem? Please describe. While running a workflow on slurm with large files, if it needs to be cancelled or stopped due to 4 hr slurm job restriction, we will have partial results. Can we have robust resume feature which can handle even if files are partially processed.

Describe the solution you'd like Merging output file parts (0.part, 1.part, ...) into 1 file and compare with input file for checking whether complete processing has happened or not, if not then while job resumes the remaining file part will also be counted in set which will go for processing.

uahmed93 avatar Oct 11 '24 17:10 uahmed93