bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

bcftools v1.17 no longer finds known variants from v1.8

Open fd-bnt opened this issue 2 years ago • 14 comments

Hi all,

I inherited a setup that uses bcftools 1.8 for variant calling and decided to update to v1.17, but can no longer reproduce the (known) INDEL calls for some samples no matter how much i relax indel calling parameters.

The test experiment we set up was NGS sequencing of a plasmid library (thus the coverages in the thousands) with different concentrations (1%, 5%, 10%) of spike ins of a slightly modified plasmid containing 3 known INDEL variants at positions 115, 138 and 148.

In the original workflow we could detect all 3 positions using bcftools v1.8 at approximately the correct frequencies. Now after switching bcftools to version 1.17 I can no longer detect any of the 3 INDELs in the 1% or 5% samples while the 10% sample produces the same output for version 1.8 and 1.17.

I tried the experimental –indels-2.0 but unfortunately it did not finish after 6h so we tried to relax all other parameters using:

# Parameters for bcftools mpileup
-a INFO/AD,INFO/ADF,INFO/ADR --max-depth 200000 --max-idepth 200000 -Q 0 --indel-bias 2.0 --open-prob 20

# Parameters for bcftools call
--ploidy 1 -A -m -P 0

For 1.8 and a 1.17 “normal” run I used the same values, except –indel-bias and –open-prob were left as defaults.

I added the variant call results below. Why can I no longer detect the INDELS in the new version? Is this an expected result?

Results of bcftools 1.8 vs 1.17

V1.17 5% not finding any INDELs at expected positions 115, 138, 148:

             name  pos id ref   alt     qual  filter                                                                                                                                                                                                            info format            data q-cutoff version library spike-in
Q0_1.17_L14_relax  115  .   A C,G,T 0.669821 lowQual    DP=12509;ADF=10439,32,5,7;ADR=2025,0,1,0;AD=12464,32,6,7;VDB=4.60008e-05;SGB=-0.693147;RPBZ=5.81608;MQBZ=-0.254136;MQSBZ=0.034046;BQBZ=-7.58792;SCBZ=0.754134;MQ0F=0;AC=0,0,0;AN=1;DP4=10439,2025,44,1;MQ=59  GT:PL 0:0,255,255,255       Q0    1.17     L14       5%
Q0_1.17_L14_relax  138  .   G A,T,C 0.670060 lowQual DP=13128;ADF=10675,21,18,22;ADR=2391,0,1,0;AD=13066,21,19,22;VDB=0.927333;SGB=-0.693147;RPBZ=5.84939;MQBZ=0.00208167;MQSBZ=0.0142642;BQBZ=-9.31408;SCBZ=-0.65444;MQ0F=0;AC=0,0,0;AN=1;DP4=10675,2391,61,1;MQ=59  GT:PL 0:0,255,255,255       Q0    1.17     L14       5%
Q0_1.17_L14_relax  148  .   C   T,A 1.249830 lowQual          DP=13118;ADF=10387,34,0;ADR=2692,4,1;AD=13079,38,1;VDB=0.00119279;SGB=-0.693144;RPBZ=8.71266;MQBZ=0.00165207;MQSBZ=0.0153911;BQBZ=-8.53877;SCBZ=-0.186703;MQ0F=0;AC=0,0;AN=1;DP4=10387,2692,34,5;MQ=59  GT:PL     0:0,255,255       Q0    1.17     L14       5%

V1.8 5%

      name  pos id  ref   alt     qual  filter                                                                                                                                                                                             info format            data q-cutoff version library spike-in
Q0_1.8_L14  115  .    A C,G,T 0.669819 lowQual DP=12509;ADF=10439,32,5,7;ADR=2025,0,1,0;AD=12464,32,6,7;VDB=6.93008e-05;SGB=-0.693147;RPB=4.45644e-08;MQB=0.968236;MQSB=0.999421;BQB=1.54514e-13;MQ0F=0;AC=0,0,0;AN=1;DP4=10439,2025,44,1;MQ=59  GT:PL 0:0,255,255,255       Q0     1.8     L14       5%
Q0_1.8_L14  115  .   AC     A 3.036510    PASS                                  INDEL;IDV=591;IMF=0.0472422;DP=12510;ADF=9941,543;ADR=1938,88;AD=11879,631;VDB=0.504203;SGB=-0.693147;MQSB=0.999421;MQ0F=0;AC=0;AN=1;DP4=9941,1938,543,88;MQ=59  GT:PL         0:0,255       Q0     1.8     L14       5%
Q0_1.8_L14  138  .    G A,T,C 0.670058 lowQual DP=13128;ADF=10675,21,18,22;ADR=2391,0,1,0;AD=13066,21,19,22;VDB=0.93761;SGB=-0.693147;RPB=3.62158e-08;MQB=0.999998;MQSB=0.999898;BQB=1.03409e-19;MQ0F=0;AC=0,0,0;AN=1;DP4=10675,2391,61,1;MQ=59  GT:PL 0:0,255,255,255       Q0     1.8     L14       5%
Q0_1.8_L14  138  .   GC     G 3.025680    PASS                               INDEL;IDV=632;IMF=0.0481377;DP=13129;ADF=10164,573;ADR=2288,104;AD=12452,677;VDB=0.19033;SGB=-0.693147;MQSB=0.999898;MQ0F=0;AC=0;AN=1;DP4=10164,2288,573,104;MQ=59  GT:PL         0:0,255       Q0     1.8     L14       5%
Q0_1.8_L14  148  .    C   T,A 1.249830 lowQual           DP=13118;ADF=10387,34,0;ADR=2692,4,1;AD=13079,38,1;VDB=0.00155996;SGB=-0.693144;RPB=3.66195e-17;MQB=0.999999;MQSB=0.999882;BQB=1.2957e-16;MQ0F=0;AC=0,0;AN=1;DP4=10387,2692,34,5;MQ=59  GT:PL     0:0,255,255       Q0     1.8     L14       5%
Q0_1.8_L14  148  . CTTT   CTT 3.039690    PASS                               INDEL;IDV=655;IMF=0.0499314;DP=13118;ADF=9783,638;ADR=2548,149;AD=12331,787;VDB=0.0882275;SGB=-0.693147;MQSB=0.999882;MQ0F=0;AC=0;AN=1;DP4=9783,2548,638,149;MQ=59  GT:PL         0:0,255       Q0     1.8     L14       5%

V1.17 10% finding all 3 INDELS

             name  pos id  ref   alt     qual  filter                                                                                                                                                                                                                                 info format            data q-cutoff version library spike-in
Q0_1.17_L15_relax  115  .    A C,T,G 0.669860 lowQual                          DP=16371;ADF=13557,48,9,5;ADR=2751,0,1,0;AD=16308,48,10,5;VDB=0.000196249;SGB=-0.693147;RPBZ=6.05133;MQBZ=-0.431482;MQSBZ=-0.102198;BQBZ=-8.76334;SCBZ=0.873;MQ0F=0;AC=0,0,0;AN=1;DP4=13557,2751,62,1;MQ=59  GT:PL 0:0,255,255,255       Q0    1.17     L15      10%
Q0_1.17_L15_relax  115  .   AC     A 3.297230    PASS    INDEL;IDV=1493;IMF=0.0911589;DP=16378;ADF=12397,1229;ADR=2488,264;AD=14885,1493;VDB=0.101671;SGB=-0.693147;RPBZ=0.728506;MQBZ=0.029997;MQSBZ=-0.102198;BQBZ=-8.76334;SCBZ=-1.34247;MQ0F=0;AC=0;AN=1;DP4=12397,2488,1229,264;MQ=59  GT:PL         0:0,152       Q0    1.17     L15      10%
Q0_1.17_L15_relax  138  .    G A,T,C 0.669957 lowQual                       DP=17241;ADF=13907,26,22,33;ADR=3253,0,0,0;AD=17160,26,22,33;VDB=0.457625;SGB=-0.693147;RPBZ=7.28345;MQBZ=-0.186605;MQSBZ=-0.0565974;BQBZ=-11.803;SCBZ=0.339357;MQ0F=0;AC=0,0,0;AN=1;DP4=13907,3253,81,0;MQ=59  GT:PL 0:0,255,255,255       Q0    1.17     L15      10%
Q0_1.17_L15_relax  138  .   GC     G 3.289400    PASS INDEL;IDV=1649;IMF=0.0956275;DP=17244;ADF=12664,1327;ADR=2931,322;AD=15595,1649;VDB=0.246366;SGB=-0.693147;RPBZ=-2.14384;MQBZ=0.0300275;MQSBZ=-0.0565974;BQBZ=-11.803;SCBZ=-0.0819827;MQ0F=0;AC=0;AN=1;DP4=12664,2931,1327,322;MQ=59  GT:PL         0:0,167       Q0    1.17     L15      10%
Q0_1.17_L15_relax  148  .    C T,A,G 0.669713 lowQual                      DP=17330;ADF=13655,44,4,4;ADR=3621,2,0,0;AD=17276,46,4,4;VDB=0.0780127;SGB=-0.693147;RPBZ=8.97576;MQBZ=0.00735484;MQSBZ=-0.0617568;BQBZ=-10.2599;SCBZ=-0.0513985;MQ0F=0;AC=0,0,0;AN=1;DP4=13655,3621,52,2;MQ=59  GT:PL 0:0,255,255,255       Q0    1.17     L15      10%
Q0_1.17_L15_relax  148  . CTTT   CTT 3.279090    PASS  INDEL;IDV=1650;IMF=0.0952051;DP=17331;ADF=12402,1306;ADR=3279,344;AD=15681,1650;VDB=0.111354;SGB=-0.693147;RPBZ=-1.79498;MQBZ=0.0426755;MQSBZ=-0.0617568;BQBZ=-10.2599;SCBZ=0.799944;MQ0F=0;AC=0;AN=1;DP4=12402,3279,1306,344;MQ=59  GT:PL         0:0,173       Q0    1.17     L15      10%

V1.8 10%

      name  pos id  ref   alt     qual  filter                                                                                                                                                                                              info format            data q-cutoff version library spike-in
Q0_1.8_L15  115  .    A C,T,G 0.669850 lowQual  DP=16371;ADF=13557,48,9,5;ADR=2751,0,1,0;AD=16308,48,10,5;VDB=0.000271031;SGB=-0.693147;RPB=1.23442e-08;MQB=0.911134;MQSB=0.994793;BQB=6.4333e-19;MQ0F=0;AC=0,0,0;AN=1;DP4=13557,2751,62,1;MQ=59  GT:PL 0:0,255,255,255       Q0     1.8     L15      10%
Q0_1.8_L15  115  .   AC     A 3.066710    PASS                           INDEL;IDV=1493;IMF=0.0911589;DP=16378;ADF=12307,1319;ADR=2476,276;AD=14783,1595;VDB=0.655657;SGB=-0.693147;MQSB=0.994791;MQ0F=0;AC=0;AN=1;DP4=12307,2476,1319,276;MQ=59  GT:PL         0:0,255       Q0     1.8     L15      10%
Q0_1.8_L15  138  .    G A,T,C 0.669956 lowQual DP=17241;ADF=13907,26,22,33;ADR=3253,0,0,0;AD=17160,26,22,33;VDB=0.491096;SGB=-0.693147;RPB=3.08703e-12;MQB=0.982737;MQSB=0.998399;BQB=3.60759e-31;MQ0F=0;AC=0,0,0;AN=1;DP4=13907,3253,81,0;MQ=59  GT:PL 0:0,255,255,255       Q0     1.8     L15      10%
Q0_1.8_L15  138  .   GC     G 3.041500    PASS                           INDEL;IDV=1649;IMF=0.0956275;DP=17244;ADF=12608,1383;ADR=2926,327;AD=15534,1710;VDB=0.307998;SGB=-0.693147;MQSB=0.998399;MQ0F=0;AC=0;AN=1;DP4=12608,2926,1383,327;MQ=59  GT:PL         0:0,255       Q0     1.8     L15      10%
Q0_1.8_L15  148  .    C T,A,G 0.669694 lowQual    DP=17330;ADF=13655,44,4,4;ADR=3621,2,0,0;AD=17276,46,4,4;VDB=0.0996844;SGB=-0.693147;RPB=3.41775e-18;MQB=0.999973;MQSB=0.998094;BQB=2.92277e-24;MQ0F=0;AC=0,0,0;AN=1;DP4=13655,3621,52,2;MQ=59  GT:PL 0:0,255,255,255       Q0     1.8     L15      10%
Q0_1.8_L15  148  . CTTT   CTT 3.064180    PASS                           INDEL;IDV=1650;IMF=0.0952051;DP=17331;ADF=12215,1493;ADR=3222,401;AD=15437,1894;VDB=0.184847;SGB=-0.693147;MQSB=0.998094;MQ0F=0;AC=0;AN=1;DP4=12215,3222,1493,401;MQ=59  GT:PL         0:0,255       Q0     1.8     L15      10%

fd-bnt avatar May 30 '23 07:05 fd-bnt

Is there any chance you could provide a small test case? I am unable to comment without seeing the data unfortunately.

pd3 avatar May 30 '23 08:05 pd3

I produced some test files by starting with a bit of covid-19 data and replicating up one record that had an errant 1bp deletion.

https://github.com/samtools/bcftools/pull/1824 broke the call. In theory this was meant to be guarded behind --indels-2.0, but it also changed the normal behaviour. Ie try builds from (currently) git checkout develop~49 vs git checkout develop~48.

My test data at Sanger is seq4d:/local/scratch01/jkb/3000.sam and 1000.sam. 3000.sam is about 14% af while 3000 is 33%. Oddly I can't get the 1000.sam to call with 1.8 either, but maybe I'm not using the correct options.,

jkbonfield avatar May 30 '23 09:05 jkbonfield

Is there any chance you could provide a small test case? I am unable to comment without seeing the data unfortunately.

Unfortunately I can not share the exact sequences. I may be able to generate an artificial dataset with similar characteristics for testing purposes.

fd-bnt avatar May 30 '23 09:05 fd-bnt

For #1824 bug, we can already reproduce it locally.

# on seq4d
cd /local/scratch01/jkb
bcftools mpileup -B -L 99999 -d 999999 --fasta-ref /nfs/srpipe_references/references/SARS-CoV-2/MN908947.3/all/fasta/MN908947.3.fa -Q 1 -a AD 3000.sam 2>/dev/null |bcftools call -vm - 2>/dev/null|grep -v '^#'

MN908947.3	78	.	TAAAA	TAAA	44.2646	.	INDEL;IDV=3001;IMF=0.330252;DP=9087;VDB=0;SGB=-0.693147;RPBZ=-0.510721;MQBZ=0;BQBZ=-2.5731;NMBZ=-22.2016;SCBZ=-0.612311;FS=0;MQ0F=0;AC=1;AN=2;DP4=6083,0,3001,0;MQ=60	GT:PL:AD	0/1:77,0,59:6083,3001

That's with b5cbcd5d. One commit later, 9d948cf3, no longer finds this. Comparing the two lines of mpileup (ie not call) I see:

#fails
MN908947.3	78	.	TAAAA	TAAA	0	.	INDEL;IDV=3001;IMF=0.330252;DP=9087;I16=6086,0,3001,0,243440,9.7376e+06,120040,4.8016e+06,365160,2.19096e+07,180060,1.08036e+07,152150,3.80375e+06,75025,1.87562e+06;QS=0.589033,0.410967;VDB=0;SGB=-0.693147;RPBZ=-0.510721;MQBZ=0;BQBZ=-2.5731;NMBZ=-22.2308;SCBZ=-0.612311;FS=0;MQ0F=0	PL:AD	24,0,27:6086,3001

#works
MN908947.3	78	.	TAAAA	TAAA	0	.	INDEL;IDV=3001;IMF=0.330252;DP=9087;I16=6083,0,3001,0,243320,9.7328e+06,120040,4.8016e+06,364980,2.18988e+07,180060,1.08036e+07,152075,3.80188e+06,75025,1.87562e+06;QS=0.588985,0.411015;VDB=0;SGB=-0.693147;RPBZ=-0.510721;MQBZ=0;BQBZ=-2.5731;NMBZ=-22.2016;SCBZ=-0.612311;FS=0;MQ0F=0	PL:AD	77,0,59:6083,3001

These are almost identical, so I'm not sure what broke. It's possibly the call side of things.

Neither can cope with the 1000.sam file, which shows this from bcftools mpileup:

MN908947.3	78	.	TAAAA	TAAA	0	.	INDEL;IDV=1001;IMF=0.141245;DP=7087;I16=6083,0,1001,0,243320,9.7328e+06,40040,1.6016e+06,364980,2.18988e+07,60060,3.6036e+06,152075,3.80188e+06,25025,625625;QS=0.811183,0.188817;VDB=0;SGB=-0.693147;RPBZ=-0.333316;MQBZ=0;BQBZ=-2.56641;NMBZ=-14.4851;SCBZ=-0.400981;FS=0;MQ0F=0	PL:AD	0,235,67:6083,1001

I don't have time to delve into this, and I expect @pd3 has a much better understanding of what changed in that large merge.

jkbonfield avatar May 30 '23 10:05 jkbonfield

Ah yes I can confirm using the older mpileup piped into the newer call fails, but piped into older call works. That was my test data anyway. That imlies the change we need to investigate is on the call side and not pileup side.

Can you confirm that you see the same behaviour?

jkbonfield avatar May 30 '23 11:05 jkbonfield

Unfortunately I can not share the exact sequences. I may be able to generate an artificial dataset with similar characteristics for testing purposes.

@fd-bnt Could you perhaps create an anonymized test with https://github.com/pd3/mpileup-tests/blob/main/misc/create-bam-test?

pd3 avatar May 30 '23 11:05 pd3

I tried the experimental –indels-2.0 but unfortunately it did not finish after 6h

I just pushed a commit that will likely fix the issue of --indels-2.0 never finishing.

pd3 avatar May 31 '23 13:05 pd3

# on seq4d
cd /local/scratch01/jkb
bcftools mpileup -B -L 99999 -d 999999 --fasta-ref /nfs/srpipe_references/references/SARS-CoV-2/MN908947.3/all/fasta/MN908947.3.fa -Q 1 -a AD 3000.sam 2>/dev/null |bcftools call -vm - 2>/dev/null|grep -v '^#'

MN908947.3	78	.	TAAAA	TAAA	44.2646	.	INDEL;IDV=3001;IMF=0.330252;DP=9087;VDB=0;SGB=-0.693147;RPBZ=-0.510721;MQBZ=0;BQBZ=-2.5731;NMBZ=-22.2016;SCBZ=-0.612311;FS=0;MQ0F=0;AC=1;AN=2;DP4=6083,0,3001,0;MQ=60	GT:PL:AD	0/1:77,0,59:6083,3001

I can see the likelihoods changed significantly in between these versions, that should not happen. This is a good test case, thanks for that.

pd3 avatar May 31 '23 14:05 pd3

What I found thus far:

  1. The --indels-2.0 calls the indel variant okay, so likely there is a good chance it will help, @fd-bnt.

  2. Regarding the difference between https://github.com/samtools/bcftools/commit/b5cbcd5d047a8478ac892d8fec4a5c4ffc4e0a48 and https://github.com/samtools/bcftools/commit/9d948cf3a77b5e13028d87c9c5bf2c80012af7a5, it comes down to just three reads

A00971:186:H75HJDRXY:1:2258:24361:29277
A00971:186:H75HJDRXY:1:2227:4670:21950 
A00971:186:H75HJDRXY:1:2158:3531:14465

which are included by the new version but not by the old version.

The rest seems to be a consequence of random selection of 255 or so reads in errmod_cal, where different reads happen to be sampled, I believe (to be verified). In the end this leads to a ~4000x fold change in PL estimate for the heterozygous genotype. Needless to say, this is incredibly fragile, one certainly expects similar answer given the deep coverage data.

pd3 avatar Jun 01 '23 13:06 pd3

Hey,

So i can confirm a newly built bcftools is finishing with --indels-2.0 when run on my original data sets. Though the INDEL calling is still missing the INDELs detected by version 1.8.

Using mpileup -a INFO/AD,INFO/ADF,INFO/ADR --max-depth 200000 --max-idepth 200000 -Q 0 --indels-2.0 and call --ploidy 1 -A -m -P 0 for version 1.17-52-g0773541c+htslib-1.17-29-g878cff4 i get:

test_name       115     .       A       C,G,T   0.669821        lowQual DP=12509;ADF=10439,32,5,7;ADR=2025,0,1,0;AD=12464,32,6,7;VDB=4.60008e-05;SGB=-0.693147;RPBZ=5.81608;MQBZ=-0.254136;MQSBZ=0.034046;BQBZ=-7.58792;SCBZ=0.754134;MQ0F=0;AC=0,0,0;AN=1;DP4=10439,2025,44,1;MQ=59    GT:PL   0:0,255,255,255
test_name       138     .       G       A,T,C   0.67006         lowQual DP=13128;ADF=10675,21,18,22;ADR=2391,0,1,0;AD=13066,21,19,22;VDB=0.927333;SGB=-0.693147;RPBZ=5.84939;MQBZ=0.00208167;MQSBZ=0.0142642;BQBZ=-9.31408;SCBZ=-0.65444;MQ0F=0;AC=0,0,0;AN=1;DP4=10675,2391,61,1;MQ=59 GT:PL   0:0,255,255,255
test_name       148     .       C       T,A     1.24983         lowQual DP=13118;ADF=10387,34,0;ADR=2692,4,1;AD=13079,38,1;VDB=0.00119279;SGB=-0.693144;RPBZ=8.71266;MQBZ=0.00165207;MQSBZ=0.0153911;BQBZ=-8.53877;SCBZ=-0.186703;MQ0F=0;AC=0,0;AN=1;DP4=10387,2692,34,5;MQ=59  GT:PL   0:0,255,255

I am trying to create an artificial data set I can share to re-create the behaviour, though it's not something i have done before so I am trying to figure out how to best produce this. If you have any best practice or similar for this it would be great.

fd-bnt avatar Jun 01 '23 16:06 fd-bnt

I'd suggest to use https://github.com/pd3/mpileup-tests/blob/main/misc/create-bam-test. It creates a slice of the bam file and the reference genome, so the new coordinates are completely nonsensical. It also renames the sample and read group.

This could go a bit further and also randomly shuffle A,C,G,T bases, but of course even that cannot prevent a motivated person from finding the source genome and the position when the sequence is unique enough.

Also read names could be removed.

Yet another level, you could share the data with us offline, we'll delete them after. My email address is on my github profile page.

Let me know if any of these enhancements would make sharing the data possible.

pd3 avatar Jun 05 '23 09:06 pd3

I just added a new option -a, --anonymize to https://github.com/pd3/mpileup-tests/blob/main/misc/create-bam-test, which should add a sufficient level of anonymization for most data.

pd3 avatar Jun 06 '23 09:06 pd3

I think I managed to create an artificial dataset that reproduces the behaviour as far as i can tell. I took the original mapping and replaced reference sequence and alignments read sequence in the sam-file column 10 according to the cigar string. The reads should now contain the deletions at the exact same place and frequency as the original dataset, while having a 100% random (reference)sequence. Though any mutations are lost, but for the current problem i think that is irrelevant.

I have attached the reference and mapping file as well as the results for mpileup v1.8 and v1.17. synthetic_sequences.zip synth_mpileup_results.zip

Produced with v1.8: bcftools mpileup -a INFO/AD,INFO/ADF,INFO/ADR --max-depth 200000 --max-idepth 200000 -Q 0 -f synth.fa -o synth.bcf8.mpileup.vcf synth.sam (shows INDELs)

and v1.17: bcftools mpileup -a INFO/AD,INFO/ADF,INFO/ADR --max-depth 200000 --max-idepth 200000 -Q 0 --indels-2.0 -f synth.fa -o synth.mpileup.vcf synth.sam (missing INDELs)

I hope this is reproducible.

fd-bnt avatar Jun 13 '23 09:06 fd-bnt

@fd-bnt Thank you, got it and can confirm that it is reproducible. The difference happened between 1.12 and 1.13.

pd3 avatar Jun 15 '23 09:06 pd3