mag
mag copied to clipboard
Pipeline breaks with too large metagenome after assembly (here: 122 million contigs)
Description of the bug
With very large metagenomes (here: 122 million contigs), process NFCORE_MAG:MAG:METABAT2_BINNING:BOWTIE2_ASSEMBLY_BUILD fails with
Command error:
WARNING: Skipping mount /var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
[E::bam_hdr_write] Header too long for BAM format
[main_samview] failed to write the SAM header
samtools sort: failed to read header from "-"
Steps to reproduce
I guess by taking any fasta file with many sequences, making a bowtie2 index, mapping an arbitraty number of sequencing reads to it, and converting the result to BAM format. I did test with different samtools versions. The pipeline specific container uses samtools 1.11, I tested with 1.14, same result.
Cause of the issue:
The header of BAM files is restricted to a specific size. The SAM header seems to include all reference sequences with name and length information (lines starting with @SG). This SAM header seems not size restricted, but when converting the file to BAM, the header may only encompass a specific size.
This limit is also documented in section 4.2 of the SAM/BAM specification.
Potential fixes:
- use
--megahit_optionswith higher contig length threshold, e.g.--megahit_options "--min-contig-len 1500"(default is 200, increase to sensible number) - decreasing header size by renaming assembled contigs, e.g. with the hexagesimal system, proposed here
- filter reference sequences (i.e. reducing metagenome contigs) before making bowtie index and aligning reads
- filter SAM so that its header is small enough before converting to BAM
My fix:
For the metagenome in question I used solution 3: I filtered before indexing and alignment all contigs below 1500bp size from the assembly. I did this with the existing split_fasta.py using split_fasta.py <assembly file> 100000000 1000000000 1500. I chose the threshold of 1500bp contig length because this step is producing depth information for contig binning with MataBAT2, which is only considering contigs above 1500bp length anyway.
However, I do not think this is optimal, because there might be false positives alignments (i.e. reads aligning ambiguously to >1500bp and <1500bp contig would be labelled now unambiguously aligned).
I think it would be best to use solution 4, but I am not sure whether this is feasible (I did not attempt it).