Compress FASTA files with contigs and scaffolds using bgzip instead of gzip, to enable file indexing
Description of feature
Certain downstream analysis tools expect the FASTA files with contigs and scaffolds to be compressed with bgzip instead of gzip, for file indexing purposes. To save time with decompressing and recompressing, it would be helpful if the MEGAHIT and SPAdes modules would directly compress their outputs with bgzip instead of gzip.
Do you have examples of what tools require this type of gzipping?
This will require quite a fundamental change of the official nf-core modules, so I want to check if it's worth the effofrt.
I've been looking into QC tools for finding misassemblies from reads mapped to contigs. Two promising examples are DeepMAsED (Deep learning for Metagenome Assembly Error Detection) and ResMiCo (Residual neural network for Misassembled Contig identification). Both of them use BAM files created by the pipeline and they require indexing of both the BAM files with mapped reads and the FASTA files with assembled contigs. The latter are failing due to plain gzip compression.
On a side note, these tools are available in Bioconda and Biocontainers, so they'd make good candidates for adding as nf-core modules and being incorporated into nf-core/mag to complement MetaQUAST for assembly QC. Any thoughts on this?
I'm including some relevant links below:
DeepMAsED (Deep learning for Metagenome Assembly Error Detection) https://academic.oup.com/bioinformatics/article/36/10/3011/5756210 https://github.com/leylabmpi/DeepMAsED https://anaconda.org/bioconda/deepmased/files https://quay.io/repository/biocontainers/deepmased?tab=tags
ResMiCo (Residual neural network for Misassembled Contig identification) https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011001 https://github.com/leylabmpi/ResMiCo https://anaconda.org/bioconda/resmico/files https://quay.io/repository/biocontainers/resmico?tab=tags
Yet another QC tool for misassembly detection is metaMIC (Reference-free Misassembly Identification and Correction of metagenomic assemblies), this one's is using a random forest classifier instead of deep learning and it also starts from BAM + FASTA (while also requiring a samtools pileup file):
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02810-y https://github.com/ZhaoXM-Lab/metaMIC
Unfortunately, this package isn't in Bioconda or Biocontainers.
I've been looking into QC tools for finding misassemblies from reads mapped to contigs. Two promising examples are DeepMAsED (Deep learning for Metagenome Assembly Error Detection) and ResMiCo (Residual neural network for Misassembled Contig identification). Both of them use BAM files created by the pipeline and they require indexing of both the BAM files with mapped reads and the FASTA files with assembled contigs. The latter are failing due to plain gzip compression.
On a side note, these tools are available in Bioconda and Biocontainers, so they'd make good candidates for adding as nf-core modules and being incorporated into nf-core/mag to complement MetaQUAST for assembly QC. Any thoughts on this?
I'm including some relevant links below:
DeepMAsED (Deep learning for Metagenome Assembly Error Detection) https://academic.oup.com/bioinformatics/article/36/10/3011/5756210 https://github.com/leylabmpi/DeepMAsED https://anaconda.org/bioconda/deepmased/files https://quay.io/repository/biocontainers/deepmased?tab=tags
ResMiCo (Residual neural network for Misassembled Contig identification) https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011001 https://github.com/leylabmpi/ResMiCo https://anaconda.org/bioconda/resmico/files https://quay.io/repository/biocontainers/resmico?tab=tags
Short answer:
finding misassemblies from reads mapped to contigs
This sounds in scope to me! Maybe propose this on the slack channel first!
Glad to know there's potential interest in these tools. I will test them some more to see if any of them would be worth integrating into the pipeline.
During testing, I ran into a minor issue which I've described separately: https://github.com/nf-core/mag/issues/750