mag icon indicating copy to clipboard operation
mag copied to clipboard

Pipeline breaks with too large metagenome after assembly (here: 122 million contigs)

Open d4straub opened this issue 3 years ago • 0 comments
trafficstars

Description of the bug

With very large metagenomes (here: 122 million contigs), process NFCORE_MAG:MAG:METABAT2_BINNING:BOWTIE2_ASSEMBLY_BUILD fails with

Command error:
  WARNING: Skipping mount /var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
  [E::bam_hdr_write] Header too long for BAM format
  [main_samview] failed to write the SAM header
  samtools sort: failed to read header from "-"

Steps to reproduce

I guess by taking any fasta file with many sequences, making a bowtie2 index, mapping an arbitraty number of sequencing reads to it, and converting the result to BAM format. I did test with different samtools versions. The pipeline specific container uses samtools 1.11, I tested with 1.14, same result.

Cause of the issue:

The header of BAM files is restricted to a specific size. The SAM header seems to include all reference sequences with name and length information (lines starting with @SG). This SAM header seems not size restricted, but when converting the file to BAM, the header may only encompass a specific size. This limit is also documented in section 4.2 of the SAM/BAM specification.

Potential fixes:

  1. use --megahit_options with higher contig length threshold, e.g. --megahit_options "--min-contig-len 1500" (default is 200, increase to sensible number)
  2. decreasing header size by renaming assembled contigs, e.g. with the hexagesimal system, proposed here
  3. filter reference sequences (i.e. reducing metagenome contigs) before making bowtie index and aligning reads
  4. filter SAM so that its header is small enough before converting to BAM

My fix:

For the metagenome in question I used solution 3: I filtered before indexing and alignment all contigs below 1500bp size from the assembly. I did this with the existing split_fasta.py using split_fasta.py <assembly file> 100000000 1000000000 1500. I chose the threshold of 1500bp contig length because this step is producing depth information for contig binning with MataBAT2, which is only considering contigs above 1500bp length anyway. However, I do not think this is optimal, because there might be false positives alignments (i.e. reads aligning ambiguously to >1500bp and <1500bp contig would be labelled now unambiguously aligned). I think it would be best to use solution 4, but I am not sure whether this is feasible (I did not attempt it).

d4straub avatar Jan 17 '22 15:01 d4straub