mag icon indicating copy to clipboard operation
mag copied to clipboard

Check GTDB database integrity before proceeding with GTDB-Tk classify workflow

Open mshamash opened this issue 4 years ago • 1 comments

Is your feature request related to a problem? Please describe

I was using the nf-core/mag pipeline on a bunch of samples recently, and the GTDB-Tk classify workflow step kept failing with segfaults. After much digging, it turns out that the GTDB database extraction did not go as expected, and some files had 0-byte sizes. This caused the FastANI step of GTDB-Tk to fail and issue a segfault. I ran gtdbtk check_install on the extracted database and it failed with a hash mismatch error. I re-extracted the database and

Describe the solution you'd like

gtdbtk check_install should be run after the GTDB has been extracted, and if it fails, then the DB should be re-extracted. Only if this check passes, then the pipeline should continue with GTDB-Tk workflows.

Additional context

Adding a gtdbtk check_install step after the DB has been extracted, and prior to GTDB-Tk usage, can save lots of potentially wasted compute and queuing time if the DB extraction didn't complete fully.

mshamash avatar Sep 16 '21 00:09 mshamash

Thanks for bringing this up! Sounds like a good idea.

d4straub avatar Sep 16 '21 06:09 d4straub