snippy
snippy copied to clipboard
REF allele even though no reads have REF
Checking some samples in IGV, I noticed several genotypes where snippy-core called the REF allele, even though none of the aligned reads were REF.
Inspecting more closely, this happens because freebayes does not output the SNP (false negative), but the site has 10 or more reads, so does not get explicitly masked either.
See the script below to reproduce. In summary, run snippy/snippy-core on samples SRR3219271 and SRR3219831 from SRA, aligning against NC_016845.1 from NCBI nuccore. Then, check position 4,308,984. All 12 reads of SRR3219271 are for the alternate allele G, yet its allele in core.vcf is the reference C.
In addition, here is a VCF of the differences from snippy-core, annotated with allele depths from bcftools mpileup. At most of these sites, one of the samples was called REF but had (almost) no REF reads. Note these are very closely related samples -- a joint mpileup shows about 4 SNPs between them, however snippy-core gives a SNP distance of 30.
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##commandLine="snippy-core --ref NC_016845.1.fasta SRR3219271 SRR3219831"
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=TYPE,Number=A,Type=String,Description="Allele type: snp ins del">
##contig=<ID=NC_016845.1,len=5333942>
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewVersion=1.9+htslib-1.9
##bcftools_viewCommand=view --min-ac 1:minor -Oz -o diffs.vcf.gz core.vcf; Date=Tue Feb 12 01:27:56 2019
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
##bcftools_annotateVersion=1.9+htslib-1.9
##bcftools_annotateCommand=annotate -a repileup.vcf.gz -c FMT/AD diffs.vcf.gz; Date=Tue Feb 12 01:34:06 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR3219271 SRR3219831
NC_016845.1 1767726 . A G . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:73,0 1:1,85
NC_016845.1 1767727 . G A . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:73,0 1:1,85
NC_016845.1 2519417 . C T . PASS TYPE=snp;AC=1;AN=2 GT:AD 1:0,99 0:94,0
NC_016845.1 2999873 . C T . PASS TYPE=snp;AC=1;AN=2 GT:AD 1:0,9 0:0,8
NC_016845.1 3434675 . C G . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:0,10 1:0,12
NC_016845.1 3434676 . T C . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:0,10 1:0,12
NC_016845.1 3728326 . A C . PASS TYPE=snp;AC=1;AN=2 GT:AD 1:0,13 0:0,9
NC_016845.1 3744194 . C A . PASS TYPE=snp;AC=1;AN=2 GT:AD 1:0,12 0:0,9
NC_016845.1 3757675 . G A . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:1,7 1:0,7
NC_016845.1 3757810 . G T . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:0,8 1:0,7
NC_016845.1 3807064 . G T . PASS TYPE=snp;AC=1;AN=2 GT:AD 1:63,3 0:57,1
NC_016845.1 3819590 . G C . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:0,8 1:0,19
NC_016845.1 3834511 . G C . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:0,6 1:0,6
NC_016845.1 3936967 . G A . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:0,7 1:0,8
NC_016845.1 4145095 . G C . PASS TYPE=snp;AC=1;AN=2 GT:AD 1:0,14 0:0,10
NC_016845.1 4149311 . G A . PASS TYPE=snp;AC=1;AN=2 GT:AD 1:0,7 0:0,6
NC_016845.1 4171825 . A G . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:0,6 1:0,8
NC_016845.1 4201982 . C T . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:0,8 1:0,9
NC_016845.1 4280327 . C G . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:0,9 1:0,8
NC_016845.1 4284222 . G A . PASS TYPE=snp;AC=1;AN=2 GT:AD 1:0,7 0:0,7
NC_016845.1 4288672 . C G . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:0,6 1:0,10
NC_016845.1 4297333 . A G . PASS TYPE=snp;AC=1;AN=2 GT:AD 1:0,18 0:1,13
NC_016845.1 4297498 . G C . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:0,16 1:0,14
NC_016845.1 4308984 . C G . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:0,10 1:0,10
NC_016845.1 4309935 . G C . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:0,9 1:0,15
NC_016845.1 4334377 . C T . PASS TYPE=snp;AC=1;AN=2 GT:AD 1:0,15 0:0,8
NC_016845.1 4369099 . A G . PASS TYPE=snp;AC=1;AN=2 GT:AD 1:0,93 0:98,0
NC_016845.1 4400154 . G A . PASS TYPE=snp;AC=1;AN=2 GT:AD 1:0,6 0:0,7
NC_016845.1 4405624 . G A . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:1,7 1:0,11
NC_016845.1 4568811 . T C . PASS TYPE=snp;AC=1;AN=2 GT:AD 0:0,10 1:0,12
I created this vcf with the following script:
#!/bin/bash
set -ex
REF=NC_016845.1
SAMPLE1=SRR3219271
SAMPLE2=SRR3219831
efetch -db nuccore -format fasta -id $REF > $REF.fasta
function run_snippy {
local sample=$1
fastq-dump --origfmt -I --split-files --gzip $sample
trim_galore --paired --gzip "$sample"_*.fastq.gz
snippy --outdir $sample --ref $REF.fasta --R1 "$sample"_1_val_1.fq.gz --R2 "$sample"_2_val_2.fq.gz
}
run_snippy "$SAMPLE1"
run_snippy "$SAMPLE2"
snippy-core --ref "$REF.fasta" "$SAMPLE1" "$SAMPLE2"
# get allele counts at sites where $SAMPLE1 != $SAMPLE2
bcftools view --min-ac 1:minor core.vcf -Oz -o diffs.vcf.gz
bcftools query -f '%CHROM\t%POS\n' diffs.vcf.gz > diffs.positions
bcftools mpileup --fasta-ref $REF.fasta -R diffs.positions $SAMPLE1/snps.bam $SAMPLE2/snps.bam -a FORMAT/AD -Oz -o repileup.vcf.gz
# merge snippy genotypes with mpileup allele depths
tabix diffs.vcf.gz
tabix repileup.vcf.gz
bcftools annotate -a repileup.vcf.gz -c FMT/AD diffs.vcf.gz > diffs_with_ad.vcf
I'm using snippy=4.3.6 and freebayes=1.2.0.
@jackkamm are all the reads you are examining, do they have mapping quality >= 60 ? also, i realise freebayes uses orphan reads (no proper pair) which can cause problems. I am investigating your example. thank you!
No, I'm not filtering by mapq here, but it appears nearly all the reads have mapq 60. At position 4,308,984 I counted 10 reads with mapq 60, and 2 reads with mapq 59. Checking a couple other positions in IGV -- all 10 reads at the false REF in 2999873 have mapq 60, while the 11 reads for the false REF at 4400154 also all have mapq 60.
Also -- I noticed the VCF I pasted above has slightly fewer reads than I see in IGV. I think that's because of the probabilistic realignment. Adding -B -q 60
to the bcftools mpileup
line in the script will turn off the realignment and restrict to mapq>=60.
Freebayes is run to use only MAPQ=60 and BASEQ=13 data to infer SNPs from.
However it is using orphan reads from split pairs, which could be the problem,. I would need to filter the BAM first as freebayes does not have a -f/-F
option.
Just found this too:
"samtools 1.8 discards overlaps by default. Meaning that from two paired reads that cover a position, it will select only one, by increasing the base quality of the selected read and setting the other quality value to 0. Disable with -x "
I have exactly the same problem. @tseemann @jackkamm
The snippy call on a strain (let's call it "strain_1") dont found any SNP at a given a position (vcf, tab, all files). But at the given position, the SNP exist (checked with IGV view (BWA mem) ) in the "strain_1" BAM file.
But, in the snippy-core pipeline, the SNP appear for "strain_1" with the REF allele base at the given position....
Snippy version 4.2
I upgrade with conda to 4.6 to see if the problem is still present.
The problem is still present.
For "strain_1": The Snippy BAM file show 100% SNP ALT frac with 11 coverage. The BWA mem file show 100%SNP ALT frac with 14 coverage. The Snippy VCF file dont report this SNP. The Snippy-core VCF file report this SNP at the REF allele.
Tested with Sinpy 4.6.0. with default options.
snippy check
[13:21:48] This is snippy 4.6.0
[13:21:48] Written by Torsten Seemann
[13:21:48] Obtained from https://github.com/tseemann/snippy
[13:21:48] Detected operating system: linux
[13:21:48] Enabling bundled linux tools.
[13:21:48] Found bwa - /usr/local/snippy-4.6.0/binaries/linux/bwa
[13:21:48] Found bcftools - /usr/local/snippy-4.6.0/binaries/linux/bcftools
[13:21:48] Found samtools - /usr/local/snippy-4.6.0/binaries/linux/samtools
[13:21:48] Found java - /usr/bin/java
[13:21:48] Found snpEff - /usr/local/snippy-4.6.0/binaries/noarch/snpEff
[13:21:48] Found samclip - /usr/local/snippy-4.6.0/binaries/noarch/samclip
[13:21:48] Found seqtk - /usr/local/snippy-4.6.0/binaries/linux/seqtk
[13:21:48] Found parallel - /usr/local/snippy-4.6.0/binaries/noarch/parallel
[13:21:48] Found freebayes - /usr/local/snippy-4.6.0/binaries/linux/freebayes
[13:21:48] Found freebayes-parallel - /usr/local/snippy-4.6.0/binaries/noarch/freebayes-parallel
[13:21:48] Found fasta_generate_regions.py - /usr/local/snippy-4.6.0/binaries/noarch/fasta_generate_regions.py
[13:21:48] Found vcfstreamsort - /usr/local/snippy-4.6.0/binaries/linux/vcfstreamsort
[13:21:48] Found vcfuniq - /usr/local/snippy-4.6.0/binaries/linux/vcfuniq
[13:21:48] Found vcffirstheader - /usr/local/snippy-4.6.0/binaries/noarch/vcffirstheader
[13:21:48] Found gzip - /bin/gzip
[13:21:48] Found vt - /usr/local/snippy-4.6.0/binaries/linux/vt
[13:21:48] Found snippy-vcf_to_tab - /usr/local/snippy-4.6.0/bin/snippy-vcf_to_tab
[13:21:48] Found snippy-vcf_report - /usr/local/snippy-4.6.0/bin/snippy-vcf_report
[13:21:48] Checking version: samtools --version is >= 1.7 - ok, have 1.10
[13:21:48] Checking version: bcftools --version is >= 1.7 - ok, have 1.10
[13:21:48] Checking version: freebayes --version is >= 1.1 - ok, have 1.3.1
[13:21:49] Checking version: snpEff -version is >= 4.3 - ok, have 4.3
[13:21:49] Checking version: bwa is >= 0.7.12 - ok, have 0.7.17
[13:21:49] Dependences look good!
I try with --mincov 30
and the problem persist.
I think it's a problem is in the snippy call (SNP dont reported) and the missed SNP become REF allele in snippy-core.
Snippy snps.vcf zoom on a problem SNP : "strain_1"
b0002 | 1329 | . | A | C | 1147.13 | . | AB=0;AO=46;DP=46;QA=1322;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:46:0:0:46:1322:-119.229,-13.8474,0
b0002 | 1471 | . | C | T | 869.769 | . | AB=0;AO=33;DP=33;QA=998;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:33:0:0:33:998:-90.0916,-9.93399,0
b0002 | 1875 | . | A | T | 821.453 | . | AB=0;AO=31;DP=31;QA=943;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:31:0:0:31:943:-85.1444,-9.33193,0
same region for "strain_2"
b0002 | 1329 | . | A | C | 1854.79 | . | AB=0;AO=67;DP=68;QA=2099;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:68:0:0:67:2099:-189.013,-20.169,0
b0002 | 1471 | . | C | T | 1631.96 | . | AB=0;AO=61;DP=66;QA=1860;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:66:0:0:61:1860:-167.487,-18.3628,0
b0002 | 1512 | . | C | T | 1837.15 | . | AB=0;AO=66;DP=66;QA=2090;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:66:0:0:66:2090:-188.348,-19.868,0
b0002 | 1524 | . | T | A | 1498.15 | . | AB=0;AO=55;DP=60;QA=1732;QR=56;RO=4;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:60:4:56:55:1732:-150.952,-12.7208,0
b0002 | 1662 | . | C | T | 1189.95 | . | AB=0;AO=43;DP=43;QA=1353;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:43:0:0:43:1353:-122.039,-12.9443,0
b0002 | 1725 | . | C | T | 1406.25 | . | AB=0;AO=51;DP=52;QA=1595;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:52:0:0:51:1595:-143.634,-15.3525,0
b0002 | 1875 | . | A | T | 794.345 | . | AB=0;AO=32;DP=34;QA=913;QR=0;RO=0;TYPE=snp | GT:DP:RO:QR:AO:QA:GL | 1/1:34:0:0:32:913:-82.2611,-9.63296,0
IGV view
The first track is the BAM file of "strain_1" (BWA), the second track is the BAM file of "strain_2" (BWA) and the third track is the snippy BAM file of "strain_1".
The SNP at 1471 pos is correctly view in the 2 strains but the SNP at 1512 pos is not view in "strain_1". It's like a whole region of the b0002 (1480-1800) is not taken into account for the "strain_1".
Depth
STRAIN | POS | DEPTH
strain_1 | 1471 | 42
strain_2 | 1471 | 78
strain_1_snippy_bam | 1471 | 33
STRAIN | POS | DEPTH
strain_1 | 1512 | 37
strain_2 | 1512 | 78
strain_1_snippy_bam | 1512 | 31