metaerg icon indicating copy to clipboard operation
metaerg copied to clipboard

Best practices when rerunning after job fails

Open aimirza opened this issue 5 years ago • 5 comments

I am loving MetaErg. It makes annotations much easier. Thank you!

Many of my jobs had to be canceled while running, unfortunately. Nothing to do with MetaErg.

I know you mentioned that we can rerun with the same parameters and it will continue where it left off, thanks to the tmp folder. However, it seems that we have to add the --force parameter to work. Is there anything else we should consider? Should we delete the last files if the file was not finished writing? Or do all of the scripts rewrite the folder and not append to it? Please share the best practices when rerunning after a failed job. Thank you!

aimirza avatar Feb 21 '20 19:02 aimirza

When using the --force option, most scripts look for an exisiting output file and use that instead of running the tool again. This can lead to problems with tools like NHMMER, which write partial result files. Got one of my SGE jobs killed during a unplanned reboot of that node and this lead to an invalid and incomplete output file. If have kept the output log, you might be able to find the files the workflow was working on at the time and just replace them, but the safer option would be to rerun the workflow.

Finesim97 avatar Mar 04 '20 13:03 Finesim97

May you please list all of the tools in the workflow that write partial results? This will help me to know whether or not to delete the output file. I very much do not want to restart the workflow. Hopefully, this issue can be resolved in the next version of MetaErg?

aimirza avatar Mar 04 '20 15:03 aimirza

Sadly, I am not a developer of MetaErg and I don't know what the future of this project is. The caching behavior might even be different for different tool versions and is undocumented most of the time, but it usually just depends on the used IO library. Snakemake deals with this problem by having flag files that are written when a job finishes/fails by using && and ||. This could be added to the command calls and the check for the output file could be replaced with a flag check for the successful run. I will write a patch/PR, because as I am writing this, I see how much I want this for myself as well.

Finesim97 avatar Mar 05 '20 10:03 Finesim97

Snakemake is a great option. A recent metagenomics pipeline uses Snakemake and it works wonders for me. https://sunbeam.readthedocs.io/en/latest/usage.html

aimirza avatar Mar 07 '20 16:03 aimirza

Also, Snakmake makes it easier for me to modify parameters and select which programs to run.

aimirza avatar Mar 07 '20 16:03 aimirza