bcftools
bcftools copied to clipboard
Make --indel-bias more sensitive to indels
by switching off an existing heuristics which reduces indelQ and, in the extreme case, has the power to discard an indel entirely when the reference alignment has low score.
This is a problem for long reads, so newly --indel-bias 1.01
is added to 'ont' and 'pacbio-ccs' presets.
Tentative fix to #1459
Unfortunately repurposing an existing option is trashing the intention of the original option.
(Edit: updated with entirety of chr1.)
Chr1 on SynDip with --indel-bias 1.0
(the default) gives:
SNP Q>0 / Q>=30 / Filtered
SNP TP 265363 / 264144 / 264062
SNP FP 5666 / 4052 / 3341
SNP GT 676 / 565 / 529
SNP FN 4292 / 5511 / 5593
InDel TP 38284 / 37737 / 37611
InDel FP 1669 / 1267 / 1172
InDel GT 1632 / 1555 / 1522
InDel FN 7752 / 8299 / 8425
With --indel-bias 1.1
, to favour higher recall rate at the expense of lower precision gives:
InDel TP 38349 / 37832 / 37702
InDel FP 1723 / 1309 / 1207
InDel GT 1626 / 1551 / 1517
InDel FN 7687 / 8204 / 8334
So we see at filtered Q30 we have +35 FP and -91 FN (plus 5 fewer with the incorrect genotype assignment). It's a small shift, but it was only a small tweak to the bias value too. It's quite a favourable FP vs FN tradeoff too if you prefer more recall.
Pushing it further, to --indel-bias 4.0
we see:
InDel TP 38669 / 38285 / 38145
InDel FP 2237 / 1740 / 1591
InDel GT 1637 / 1591 / 1560
InDel FN 7367 / 7751 / 7891
That's a substantial shift. Vs the default mode it's +419FP vs -534FN. Closer to parity between the +/-.
With this PR and --indel-bias 1.1
we get:
InDel TP 38758 / 38427 / 38287
InDel FP 2614 / 2026 / 1860
InDel GT 1663 / 1622 / 1590
InDel FN 7278 / 7609 / 7749
Vs the baseline that's now +654FP and -676FN. It may be beneficial depending on what people most highly prioritise, but it's removed a dial from peoples control and has gone all-in basically.
The evidence here is that we need to evaluate the impact of such changes. At the very least, this ought to be a new option so the ability to tune for FP vs FN priority is retained. I suspect the (somewhat hacky) heuristic needs some major tuning. I don't know what it's keying into and why it works for short read data and not long read, but it's likely overly sensitive to either error rates or read lengths, so the long term solution is probably to rewrite it. (I'm OK with a short term fix to optionally simply skip this line of code though.)
This is now restructured into a separate option --no-indelQ-tweaks
. Sadly, it does not help always, still investigating..
My big concern with this is whether it'll actually last more than a few months as a half-solution.
It's not working right now on all the data files, so we need to find a better fix, which I'm researching. It's quite possible that an improved consensus construction algorithm in the bam2bcf_indel code may make this option completely redundant.
That work isn't going to appear in our next release, so if this option worked more robustly then I'd be inclined to add it anyway. However the fact it seems somewhere around a 50/50 success rate at rescuing these missing indels, while also having a major jump to false positives, makes me wonder if we should simply reject this in favour of a better implementation down the road.
In the light of all recent tests I agree, this is not a viable solution.