spades icon indicating copy to clipboard operation
spades copied to clipboard

Suggestion: Option to mask with N (or case sensitive) regions with low coverage

Open brdido opened this issue 4 years ago • 2 comments

Hi,

when using --trusted-contigs it woud be useful to know regions that had closed gaps without coverage in the result or (similar) regions that had less then "threshold coverage" to mark them as N or lower-case.. It would be very handful.

Thank you

brdido avatar Apr 15 '20 13:04 brdido

I was just about to submit the same request! I'm not using the --trusted-contigs option, but I have found that spades can produce highly accurate contigs, except in regions with low coverage. Those regions can contain a cluster of fake SNPs. So I would like to mask or annotate them. Since spades already maps reads back to the contigs for error correction, spades could do this on its own.

Currently I'm trying to do this myself by (1) running bwa mem + octopus to create a gvcf (2) using bcftools to replace letters with N if they do not have sufficient coverage. However, if I could do this with spades in one step with spades, that would be really handy.

The option to use lower-case letters instead of N would be useful. I suppose this is as close as you can get to annotating letters without removing them in a FASTA file.

Thanks for the great software.

bredelings avatar Apr 25 '20 15:04 bredelings

Hello

Thanks for the suggestions.

First of all, SPAdes does not track which regions came from trusted contigs and, as in many de Bruijn-graph based assemblers the correspondence to the input data is lost after k-mers are extracted and graph is built.

SPAdes does not track the per-base read coverage as well. However, the average contig coverage is reported and this could be used to do whatever threshold-based downstream analysis is required. Note that reported coverage is k-mer coverage, not the read coverage.

Indeed, there might be mismatches in low-covered regions especially if this region is also an instance of an inexact repeat – it tough to distinguish variants from the sequencing artifacts in this case. The mismatch correction step that involves the read alignment back to the contigs is optional (and note that it is also prone to alignment artifacts), so maybe one day we will extend it to mark "untrusted bases", however, there are no plans for doing this.

asl avatar May 06 '20 10:05 asl