QUAST_BINS can be very slow / Reconfigure BIN_SUMMARY table
Description of the bug
Creating issue as requested. This was over a year ago now but I had trouble with the QUAST_BINS process failing when running on very large numbers of bins.
My thought at the time was that this has something to do with the fact that it is implemented as a loop over each input FASTA file: https://github.com/nf-core/mag/blob/master/modules/local/quast_bins.nf, and that it's doing some kind of ORF calling for each bin.
edit: MetaQUAST calls genes using MetaGeneMark according to the paper: https://academic.oup.com/bioinformatics/article/32/7/1088/1743987
"In contrast with regular QUAST which uses GeneMarkS, MetaQUAST uses MetaGeneMark ([Zhu et al., 2010]) for gene prediction, which is developed specially for metagenomes."
edit 2:
The local module has --rna-finding --gene-finding set, so gene finding is turned on (this is disabled by default). A quick fix (ignoring SeqKit stuff) is to just remove these arguments - the pipeline profiles genes using Prodigal and Prokka anyway
Command used and terminal output
Relevant files
No response
System information
No response
Speaking with @prototaxites we discussed a bit more about this.
My general concern was whether we would loose columns that are in QUAST but not SEQKIT, which there is a single one (# predicted rRNA genes) that we could not retrieve from SEQKIT.
Jim pointed out as above above, we already get annotation info from PROKKA, and it is probably better (this info is in the prokka
So maybe I suggest that our aim should be that there should be all the information (and optionally extra) so you can evaluate with the bin_summary.tsv table, for MAG QC metrics following the MIMAG specifiation.
One of the things missing from the current bin_summary.txt is information about the presece of rRNA/tRNAs. So this we could add to the table from PROKKA
So to summarise, initially, we could:
- Replace: N50/length stats from QUAST with SEQKIT table
- Add: extra annotation stats from PROKKA .txt file
- <and any extra metrics we are still missing>