snippy icon indicating copy to clipboard operation
snippy copied to clipboard

REF allele even though no reads have REF

Open jackkamm opened this issue 5 years ago • 7 comments

Checking some samples in IGV, I noticed several genotypes where snippy-core called the REF allele, even though none of the aligned reads were REF.

Inspecting more closely, this happens because freebayes does not output the SNP (false negative), but the site has 10 or more reads, so does not get explicitly masked either.

See the script below to reproduce. In summary, run snippy/snippy-core on samples SRR3219271 and SRR3219831 from SRA, aligning against NC_016845.1 from NCBI nuccore. Then, check position 4,308,984. All 12 reads of SRR3219271 are for the alternate allele G, yet its allele in core.vcf is the reference C.

image

In addition, here is a VCF of the differences from snippy-core, annotated with allele depths from bcftools mpileup. At most of these sites, one of the samples was called REF but had (almost) no REF reads. Note these are very closely related samples -- a joint mpileup shows about 4 SNPs between them, however snippy-core gives a SNP distance of 30.

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##commandLine="snippy-core --ref NC_016845.1.fasta SRR3219271 SRR3219831"
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=TYPE,Number=A,Type=String,Description="Allele type: snp ins del">
##contig=<ID=NC_016845.1,len=5333942>
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewVersion=1.9+htslib-1.9
##bcftools_viewCommand=view --min-ac 1:minor -Oz -o diffs.vcf.gz core.vcf; Date=Tue Feb 12 01:27:56 2019
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
##bcftools_annotateVersion=1.9+htslib-1.9
##bcftools_annotateCommand=annotate -a repileup.vcf.gz -c FMT/AD diffs.vcf.gz; Date=Tue Feb 12 01:34:06 2019
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SRR3219271	SRR3219831
NC_016845.1	1767726	.	A	G	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:73,0	1:1,85
NC_016845.1	1767727	.	G	A	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:73,0	1:1,85
NC_016845.1	2519417	.	C	T	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	1:0,99	0:94,0
NC_016845.1	2999873	.	C	T	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	1:0,9	0:0,8
NC_016845.1	3434675	.	C	G	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:0,10	1:0,12
NC_016845.1	3434676	.	T	C	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:0,10	1:0,12
NC_016845.1	3728326	.	A	C	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	1:0,13	0:0,9
NC_016845.1	3744194	.	C	A	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	1:0,12	0:0,9
NC_016845.1	3757675	.	G	A	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:1,7	1:0,7
NC_016845.1	3757810	.	G	T	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:0,8	1:0,7
NC_016845.1	3807064	.	G	T	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	1:63,3	0:57,1
NC_016845.1	3819590	.	G	C	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:0,8	1:0,19
NC_016845.1	3834511	.	G	C	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:0,6	1:0,6
NC_016845.1	3936967	.	G	A	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:0,7	1:0,8
NC_016845.1	4145095	.	G	C	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	1:0,14	0:0,10
NC_016845.1	4149311	.	G	A	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	1:0,7	0:0,6
NC_016845.1	4171825	.	A	G	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:0,6	1:0,8
NC_016845.1	4201982	.	C	T	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:0,8	1:0,9
NC_016845.1	4280327	.	C	G	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:0,9	1:0,8
NC_016845.1	4284222	.	G	A	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	1:0,7	0:0,7
NC_016845.1	4288672	.	C	G	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:0,6	1:0,10
NC_016845.1	4297333	.	A	G	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	1:0,18	0:1,13
NC_016845.1	4297498	.	G	C	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:0,16	1:0,14
NC_016845.1	4308984	.	C	G	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:0,10	1:0,10
NC_016845.1	4309935	.	G	C	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:0,9	1:0,15
NC_016845.1	4334377	.	C	T	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	1:0,15	0:0,8
NC_016845.1	4369099	.	A	G	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	1:0,93	0:98,0
NC_016845.1	4400154	.	G	A	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	1:0,6	0:0,7
NC_016845.1	4405624	.	G	A	.	PASS	TYPE=snp;AC=1;AN=2	GT:AD	0:1,7	1:0,11
NC_016845.1 4568811 . T C . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:0,10 1:0,12

I created this vcf with the following script:

#!/bin/bash

set -ex

REF=NC_016845.1
SAMPLE1=SRR3219271
SAMPLE2=SRR3219831

efetch -db nuccore -format fasta -id $REF > $REF.fasta

function run_snippy {
    local sample=$1
    fastq-dump --origfmt -I --split-files --gzip $sample
    trim_galore --paired --gzip "$sample"_*.fastq.gz
    snippy --outdir $sample --ref $REF.fasta --R1 "$sample"_1_val_1.fq.gz --R2 "$sample"_2_val_2.fq.gz
}

run_snippy "$SAMPLE1"
run_snippy "$SAMPLE2"

snippy-core --ref "$REF.fasta" "$SAMPLE1" "$SAMPLE2"

# get allele counts at sites where $SAMPLE1 != $SAMPLE2
bcftools view --min-ac 1:minor core.vcf -Oz -o diffs.vcf.gz
bcftools query -f '%CHROM\t%POS\n' diffs.vcf.gz > diffs.positions
bcftools mpileup --fasta-ref $REF.fasta -R diffs.positions $SAMPLE1/snps.bam $SAMPLE2/snps.bam -a FORMAT/AD -Oz -o repileup.vcf.gz

# merge snippy genotypes with mpileup allele depths
tabix diffs.vcf.gz
tabix repileup.vcf.gz
bcftools annotate -a repileup.vcf.gz -c FMT/AD diffs.vcf.gz > diffs_with_ad.vcf

I'm using snippy=4.3.6 and freebayes=1.2.0.

jackkamm avatar Feb 12 '19 03:02 jackkamm

@jackkamm are all the reads you are examining, do they have mapping quality >= 60 ? also, i realise freebayes uses orphan reads (no proper pair) which can cause problems. I am investigating your example. thank you!

tseemann avatar Feb 12 '19 05:02 tseemann

No, I'm not filtering by mapq here, but it appears nearly all the reads have mapq 60. At position 4,308,984 I counted 10 reads with mapq 60, and 2 reads with mapq 59. Checking a couple other positions in IGV -- all 10 reads at the false REF in 2999873 have mapq 60, while the 11 reads for the false REF at 4400154 also all have mapq 60.

Also -- I noticed the VCF I pasted above has slightly fewer reads than I see in IGV. I think that's because of the probabilistic realignment. Adding -B -q 60 to the bcftools mpileup line in the script will turn off the realignment and restrict to mapq>=60.

jackkamm avatar Feb 12 '19 13:02 jackkamm

Freebayes is run to use only MAPQ=60 and BASEQ=13 data to infer SNPs from.

However it is using orphan reads from split pairs, which could be the problem,. I would need to filter the BAM first as freebayes does not have a -f/-F option.

tseemann avatar Feb 13 '19 05:02 tseemann

Just found this too:

"samtools 1.8 discards overlaps by default. Meaning that from two paired reads that cover a position, it will select only one, by increasing the base quality of the selected read and setting the other quality value to 0. Disable with -x "

tseemann avatar Dec 02 '19 05:12 tseemann

I have exactly the same problem. @tseemann @jackkamm

The snippy call on a strain (let's call it "strain_1") dont found any SNP at a given a position (vcf, tab, all files). But at the given position, the SNP exist (checked with IGV view (BWA mem) ) in the "strain_1" BAM file.

But, in the snippy-core pipeline, the SNP appear for "strain_1" with the REF allele base at the given position....

Snippy version 4.2

I upgrade with conda to 4.6 to see if the problem is still present.

Nilad avatar Aug 06 '20 16:08 Nilad

The problem is still present.

For "strain_1": The Snippy BAM file show 100% SNP ALT frac with 11 coverage. The BWA mem file show 100%SNP ALT frac with 14 coverage. The Snippy VCF file dont report this SNP. The Snippy-core VCF file report this SNP at the REF allele.

Tested with Sinpy 4.6.0. with default options.

snippy check

[13:21:48] This is snippy 4.6.0
[13:21:48] Written by Torsten Seemann
[13:21:48] Obtained from https://github.com/tseemann/snippy
[13:21:48] Detected operating system: linux
[13:21:48] Enabling bundled linux tools.
[13:21:48] Found bwa - /usr/local/snippy-4.6.0/binaries/linux/bwa
[13:21:48] Found bcftools - /usr/local/snippy-4.6.0/binaries/linux/bcftools
[13:21:48] Found samtools - /usr/local/snippy-4.6.0/binaries/linux/samtools
[13:21:48] Found java - /usr/bin/java
[13:21:48] Found snpEff - /usr/local/snippy-4.6.0/binaries/noarch/snpEff
[13:21:48] Found samclip - /usr/local/snippy-4.6.0/binaries/noarch/samclip
[13:21:48] Found seqtk - /usr/local/snippy-4.6.0/binaries/linux/seqtk
[13:21:48] Found parallel - /usr/local/snippy-4.6.0/binaries/noarch/parallel
[13:21:48] Found freebayes - /usr/local/snippy-4.6.0/binaries/linux/freebayes
[13:21:48] Found freebayes-parallel - /usr/local/snippy-4.6.0/binaries/noarch/freebayes-parallel
[13:21:48] Found fasta_generate_regions.py - /usr/local/snippy-4.6.0/binaries/noarch/fasta_generate_regions.py
[13:21:48] Found vcfstreamsort - /usr/local/snippy-4.6.0/binaries/linux/vcfstreamsort
[13:21:48] Found vcfuniq - /usr/local/snippy-4.6.0/binaries/linux/vcfuniq
[13:21:48] Found vcffirstheader - /usr/local/snippy-4.6.0/binaries/noarch/vcffirstheader
[13:21:48] Found gzip - /bin/gzip
[13:21:48] Found vt - /usr/local/snippy-4.6.0/binaries/linux/vt
[13:21:48] Found snippy-vcf_to_tab - /usr/local/snippy-4.6.0/bin/snippy-vcf_to_tab
[13:21:48] Found snippy-vcf_report - /usr/local/snippy-4.6.0/bin/snippy-vcf_report
[13:21:48] Checking version: samtools --version is >= 1.7 - ok, have 1.10
[13:21:48] Checking version: bcftools --version is >= 1.7 - ok, have 1.10
[13:21:48] Checking version: freebayes --version is >= 1.1 - ok, have 1.3.1
[13:21:49] Checking version: snpEff -version is >= 4.3 - ok, have 4.3
[13:21:49] Checking version: bwa is >= 0.7.12 - ok, have 0.7.17
[13:21:49] Dependences look good!

Nilad avatar Aug 07 '20 11:08 Nilad

I try with --mincov 30 and the problem persist.

I think it's a problem is in the snippy call (SNP dont reported) and the missed SNP become REF allele in snippy-core.

Snippy snps.vcf zoom on a problem SNP : "strain_1"

b0002 | 1329 | . | A | C | 1147.13 | . | AB=0;AO=46;DP=46;QA=1322;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:46:0:0:46:1322:-119.229,-13.8474,0
b0002 | 1471 | . | C | T | 869.769 | . | AB=0;AO=33;DP=33;QA=998;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:33:0:0:33:998:-90.0916,-9.93399,0
b0002 | 1875 | . | A | T | 821.453 | . | AB=0;AO=31;DP=31;QA=943;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:31:0:0:31:943:-85.1444,-9.33193,0

same region for "strain_2"


b0002 | 1329 | . | A | C | 1854.79 | . | AB=0;AO=67;DP=68;QA=2099;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:68:0:0:67:2099:-189.013,-20.169,0 
b0002 | 1471 | . | C | T | 1631.96 | . | AB=0;AO=61;DP=66;QA=1860;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:66:0:0:61:1860:-167.487,-18.3628,0 
b0002 | 1512 | . | C | T | 1837.15 | . | AB=0;AO=66;DP=66;QA=2090;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:66:0:0:66:2090:-188.348,-19.868,0 
b0002 | 1524 | . | T | A | 1498.15 | . | AB=0;AO=55;DP=60;QA=1732;QR=56;RO=4;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:60:4:56:55:1732:-150.952,-12.7208,0 
b0002 | 1662 | . | C | T | 1189.95 | . | AB=0;AO=43;DP=43;QA=1353;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:43:0:0:43:1353:-122.039,-12.9443,0 
b0002 | 1725 | . | C | T | 1406.25 | . | AB=0;AO=51;DP=52;QA=1595;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:52:0:0:51:1595:-143.634,-15.3525,0 
b0002 | 1875 | . | A | T | 794.345 | . | AB=0;AO=32;DP=34;QA=913;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:34:0:0:32:913:-82.2611,-9.63296,0 


IGV view

image

The first track is the BAM file of "strain_1" (BWA), the second track is the BAM file of "strain_2" (BWA) and the third track is the snippy BAM file of "strain_1".

The SNP at 1471 pos is correctly view in the 2 strains but the SNP at 1512 pos is not view in "strain_1". It's like a whole region of the b0002 (1480-1800) is not taken into account for the "strain_1".

Depth

STRAIN | POS | DEPTH 
strain_1 | 1471 | 42 
strain_2 | 1471 | 78 
strain_1_snippy_bam | 1471 | 33 

STRAIN | POS | DEPTH 
strain_1 | 1512 | 37
strain_2 | 1512 | 78 
strain_1_snippy_bam | 1512 | 31 

Nilad avatar Aug 07 '20 13:08 Nilad