snippy icon indicating copy to clipboard operation
snippy copied to clipboard

snippy-core replacing variant bases with reference in some cases but not others

Open jrober84 opened this issue 3 years ago • 1 comments

So this one is a bit of a weird and long one. I am using snippy v. 4.6.0 installed through conda in a centos based system. The datasets I am using here are quite large so if you want them I can package them up and send them separately. I am performing a population level SNP analysis for a low diversity Salmonella serotype with about 4800 samples. As part of this analysis I want to identify duplicate and highly similar sequences for removal to establish the population structure. I have mapped all of the samples against a common reference and called variants using snippy v. 4.6.0 and then ran snippy-core to produce the core alignment. Since there are poor quality sequences in this set I end up with a cor of 0 which is fine for my purposes since I want to run snp-sites on core.full.aln without the -c parameter to identify sequence which are disruptive by looking at how many times a given sequence introduces a gap in an otherwise good position. I would then remove the worst offenders, rinse and repeat. I noticed an issue that 2500 of my samples formed a 0 snp cluster with no variability between them.

So I began testing some subsets of the data and noticed new positions pop up as variant in the vcf files produced by snp-sites which were not in the original set. To generate a subset, I would run snippy-core on the specific samples I was interested in. I use the same --ref and --mask parameters between runs, the only difference is the samples included. It would make sense to loose variants but not to gain them since snp-sites should report all variant sites regardless of how many sequences were missing a base, as long as there was variation at that site. I tested snp-sites with a synthetic reference sequence which had all of the bases swapped A=>T, T=>A, C=> G, G=> C to produce a sequence where 100% of the sites were variable compared to the reference. When I add this synthetic sequence to the ~4800 sample snippy-core core.full.aln, I indeed get a vcf with all positions as variant. So snp-sites is behaving as expected

I ran snippy-core on just the set of samples which formed a 0 snp cluster with the reference sequence. I looked at the variant positions in my smaller subset and tested one position which was variant in the subset but not in the 4800 snippy-core. In the 4800 set it has a reference base T and in the subset snippy-core it has a base C for position 598. So the snippy-core produced on the complete set was actually different from the subset one.

I then tested that one sample that had the non-reference base and put it with the other sequences which did not form a 0 snp cluster with the reference. When I ran snippy-core in this case, then the sample had the reference base instead of the variant one.

I whittled down the comparison to two samples from the sra. When SRR1996191 is run with SRR5574766 in snippy-core, it causes SRR5574766 to have the reference base instead of its variant base at position 598 of NZ_CP012921.

snippy-core --ref NZ_CP012921. fasta --mask=Heidelberg.reference.mask.bed SRR1996191/ SRR5574766/

-- The mask file starts at position 57813 so I wouldn't expect it to be the cause of the issue and I tested not including the mask file and it didn't affect the results.

I looked at the individual vcf for SRR5574766 NZ_CP012921.1 598 . G T 1845.38 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=60;CIGAR=1X;DP=60;DPB=60;DPRA=0;EPP=5.32654;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=87.7828;PAIRED=0.833333;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=2118;QR=0;RO=0;RPL=34;RPP=5.32654;RPPR=0;RPR=26;RUN=1;SAF=16;SAP=31.3842;SAR=44;SRF=0;SRP=0;SRR=0;TYPE=snp GT:DP:AD:RO:QR:AO:QA:GL 1/1:60:0,60:0:0:60:2118:-190.86,-18.0618,0

and SRR1996191 there is nothing in the vcf until position 57605

When I look at the aligned.fa for SRR1996191 there is reference bases 565 and 566 but then N's until 603. So I am not sure why snippy-core is changing the variant base call for sequences to the reference base when it is run with this sequence.

This is only a single example but there were thousands of bases affected in the original set.

jrober84 avatar Jul 16 '20 23:07 jrober84

So I dug into this a bit more and the issue seems to be that snippy-core skips processing sequences as soon as it encounters 1 sequence which has an ambiguous base at a given position.

The issue is at line 273 in snippy-core next POS unless $ac =~ m/[AGTC]/i;. When this is commented out then core.full.aln has the expected set of snps for each sample. This issue would only affect those who want to get variants which are outside the 100% core sequences so this issue will not likely apply to most people using snippy-core. This is not true fix to the issue since core.vcf is now populated with positions not 100% present in the sample set

jrober84 avatar Jul 17 '20 14:07 jrober84