mag icon indicating copy to clipboard operation
mag copied to clipboard

QUAST_BINS can be very slow / Reconfigure BIN_SUMMARY table

Open prototaxites opened this issue 9 months ago • 1 comments

Description of the bug

Creating issue as requested. This was over a year ago now but I had trouble with the QUAST_BINS process failing when running on very large numbers of bins.

My thought at the time was that this has something to do with the fact that it is implemented as a loop over each input FASTA file: https://github.com/nf-core/mag/blob/master/modules/local/quast_bins.nf, and that it's doing some kind of ORF calling for each bin.

edit: MetaQUAST calls genes using MetaGeneMark according to the paper: https://academic.oup.com/bioinformatics/article/32/7/1088/1743987

"In contrast with regular QUAST which uses GeneMarkS, MetaQUAST uses MetaGeneMark ([Zhu et al., 2010]) for gene prediction, which is developed specially for metagenomes."

edit 2:

The local module has --rna-finding --gene-finding set, so gene finding is turned on (this is disabled by default). A quick fix (ignoring SeqKit stuff) is to just remove these arguments - the pipeline profiles genes using Prodigal and Prokka anyway

Command used and terminal output


Relevant files

No response

System information

No response

prototaxites avatar Apr 09 '25 12:04 prototaxites

Speaking with @prototaxites we discussed a bit more about this.

My general concern was whether we would loose columns that are in QUAST but not SEQKIT, which there is a single one (# predicted rRNA genes) that we could not retrieve from SEQKIT.

Jim pointed out as above above, we already get annotation info from PROKKA, and it is probably better (this info is in the prokka .txt) file.

So maybe I suggest that our aim should be that there should be all the information (and optionally extra) so you can evaluate with the bin_summary.tsv table, for MAG QC metrics following the MIMAG specifiation.

One of the things missing from the current bin_summary.txt is information about the presece of rRNA/tRNAs. So this we could add to the table from PROKKA

So to summarise, initially, we could:

  • Replace: N50/length stats from QUAST with SEQKIT table
  • Add: extra annotation stats from PROKKA .txt file
  • <and any extra metrics we are still missing>

jfy133 avatar Apr 11 '25 12:04 jfy133