MAGpurify
MAGpurify copied to clipboard
tetra-freq unable to handle "N" nucleotide
The tetra-freq
module crashes if there is an N
nucleotide in the nucleotide sequence.
An example error message is:
File "/home/ubuntu/.local/lib/python3.6/site-packages/magpurify/modules/tetra.py", line 87, in main contig.kmers[kmer_rev] += 1 KeyError: 'NTTC'
N
nucleotides are very common in MAGs and draft genome assemblies, so this causes errors frequently, such as when working with the UHGG.
Deletion of N
nucleotides will cause artificial adjacencies that will bias the tetra-nucleotide frequency profile. Random imputation would have similar bias. Ideally, any 4-mer with an N
would just be ignored when constructing tetra-nucleotide frequency profiles.
I noticed the same. Since scaffolding of metagenome contigs based on paired-end linkage is pretty standard, i would say this is a relatively important bug