firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Preemptable marian training on slurm.

Open ugermann opened this issue 2 years ago • 3 comments

These changes allow marian training jobs on slurm to be interrupted without losing training progress. The script requests an early warning from slurm a set amount of time (currently 300 sec.) ahead of end of time slots, runs marian in the background and shuts it down nicely with SIGTERM so that marian can save its state of training. It then examins marians exit status and cancels (or not) the current slurm job array to avoid wasting subsequent slurm scheduling slots.

Needs testing.

ugermann avatar Mar 28 '22 17:03 ugermann

One open question here is what Snakemake does with non-0 exit codes of scripts. I am completely new to Snakemake ...

ugermann avatar Mar 28 '22 17:03 ugermann

@eu9ene For testing on slurm, you could just replace the marian job by tail -f /dev/null&. That will also return 143 when killed with SIGTERM.

ugermann avatar Mar 28 '22 18:03 ugermann

One open question here is what Snakemake does with non-0 exit codes of scripts. I am completely new to Snakemake ...

Snakemake deletes the job output that is specified as "output" in the rule. It is a safety mechanism to prevent corrupted data.

eu9ene avatar Mar 28 '22 23:03 eu9ene

I believe this is meant to solve specific issues on HPC that we're not using at the moment. Closing

eu9ene avatar Sep 20 '23 00:09 eu9ene