MAGpurify icon indicating copy to clipboard operation
MAGpurify copied to clipboard

tetra-freq unable to handle "N" nucleotide

Open mmp3 opened this issue 4 years ago • 1 comments

The tetra-freq module crashes if there is an N nucleotide in the nucleotide sequence. An example error message is:

File "/home/ubuntu/.local/lib/python3.6/site-packages/magpurify/modules/tetra.py", line 87, in main contig.kmers[kmer_rev] += 1 KeyError: 'NTTC'

N nucleotides are very common in MAGs and draft genome assemblies, so this causes errors frequently, such as when working with the UHGG.

Deletion of N nucleotides will cause artificial adjacencies that will bias the tetra-nucleotide frequency profile. Random imputation would have similar bias. Ideally, any 4-mer with an N would just be ignored when constructing tetra-nucleotide frequency profiles.

mmp3 avatar May 04 '20 20:05 mmp3

I noticed the same. Since scaffolding of metagenome contigs based on paired-end linkage is pretty standard, i would say this is a relatively important bug

jvollme avatar Aug 26 '21 21:08 jvollme