shovill icon indicating copy to clipboard operation
shovill copied to clipboard

Finding SNPs against assembled reference from same reads?

Open cizydorczyk opened this issue 5 years ago • 3 comments

Hello,

I noticed recently that when I assemble a genome with Shovill and then use Snippy to call SNPs against that genome using the same reads as used to assemble the genome in the first place (i.e. assemble a genome then call SNPs against itself), I get a series of SNPs (>20) identified against the reference.

Why might something like this occur? I would expect 0 SNPs if both processes were "perfect" (although in reality I doubt this is ever the case). Could this be due to some assembly/read correction step?

Note that I ran Shovill with the '--opts "careful"' option set.

Any help in understanding this is much appreciated.

Thank you, Conrad

cizydorczyk avatar May 08 '19 18:05 cizydorczyk

Conrad

When you did --opts "careful" I assume you mean --opts "--careful" ? This enables Spades' internal contig correction step. However Shovill has its own contig correction step which was still run (you can use --nocorr to disable that).

There are many reasons you can still gets SNPs against the same genome. The main reason is repeats in the original genome are collapsed into single contigs in the assembly. So what were multi-mapping reads now like like uniquely-mapping reads. So these SNPs are false positives. This would not happen if the assembly was "correct" and in one piece, but most Illumina draft genomes will cause problems.

Can you confirm what versions of the tools you are using?

$ snippy --version
snippy 4.3.8

$ shovill --version
shovill 1.0.4

tseemann avatar May 08 '19 22:05 tseemann

My mistake - yes, I meant --opts "careful". I left Shovill contig correction enabled.

(snippy-env) -bash-4.2$ snippy --version
snippy 4.3.6
(shovill-env) -bash-4.2$ shovill --version
shovill 1.0.4

Your explanation on collapsed repeats makes sense. I would also venture a guess and suggest it might be due to the fact that I did not use error-corrected reads for SNP calling, but left read error correction enabled when I ran Shovill. I have seen mixed opinions on Spades' error correction, but I have always ran it in careful mode.

Using a draft genome as a reference seems a bit more challenging than I initially thought. Perhaps I will try to find a complete reference or consider closing a genome rather than calling SNPs against a draft. Thank you for your help!

cizydorczyk avatar May 09 '19 16:05 cizydorczyk

FYI: --opts "careful" is not correct, as it will pass careful to Spades, rather than --careful.

It would be instructive to use the --report option of Snippy to look at the actual SNP pileups in snps.report.txt to see what is going on with those 20 false positives, and to see if they are in repeat contigs.

Repeat contigs can be identified reasonably well by using the coverage. First, get the biggest contig, and use its coverage as the "1x" level. Then any contigs with multiples of that are considered multicopy. These could be artificially duplicated, or removed, from the contigs.fa; and then that could be used as the reference.

I don't use the corrected reads either in my correction step; I use the original raw reads.

tseemann avatar May 10 '19 03:05 tseemann