bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

Make --indel-bias more sensitive to indels

Open pd3 opened this issue 2 years ago • 4 comments

by switching off an existing heuristics which reduces indelQ and, in the extreme case, has the power to discard an indel entirely when the reference alignment has low score.

This is a problem for long reads, so newly --indel-bias 1.01 is added to 'ont' and 'pacbio-ccs' presets.

Tentative fix to #1459

pd3 avatar Jan 28 '22 08:01 pd3

Unfortunately repurposing an existing option is trashing the intention of the original option.

(Edit: updated with entirety of chr1.)

Chr1 on SynDip with --indel-bias 1.0 (the default) gives:

SNP          Q>0 /   Q>=30 / Filtered
SNP   TP  265363 /  264144 /  264062
SNP   FP    5666 /    4052 /    3341
SNP   GT     676 /     565 /     529
SNP   FN    4292 /    5511 /    5593

InDel TP   38284 /   37737 /   37611
InDel FP    1669 /    1267 /    1172
InDel GT    1632 /    1555 /    1522
InDel FN    7752 /    8299 /    8425

With --indel-bias 1.1, to favour higher recall rate at the expense of lower precision gives:

InDel TP   38349 /   37832 /   37702
InDel FP    1723 /    1309 /    1207
InDel GT    1626 /    1551 /    1517
InDel FN    7687 /    8204 /    8334

So we see at filtered Q30 we have +35 FP and -91 FN (plus 5 fewer with the incorrect genotype assignment). It's a small shift, but it was only a small tweak to the bias value too. It's quite a favourable FP vs FN tradeoff too if you prefer more recall.

Pushing it further, to --indel-bias 4.0 we see:

InDel TP   38669 /   38285 /   38145
InDel FP    2237 /    1740 /    1591
InDel GT    1637 /    1591 /    1560
InDel FN    7367 /    7751 /    7891

That's a substantial shift. Vs the default mode it's +419FP vs -534FN. Closer to parity between the +/-.

With this PR and --indel-bias 1.1 we get:

InDel TP   38758 /   38427 /   38287
InDel FP    2614 /    2026 /    1860
InDel GT    1663 /    1622 /    1590
InDel FN    7278 /    7609 /    7749

Vs the baseline that's now +654FP and -676FN. It may be beneficial depending on what people most highly prioritise, but it's removed a dial from peoples control and has gone all-in basically.

The evidence here is that we need to evaluate the impact of such changes. At the very least, this ought to be a new option so the ability to tune for FP vs FN priority is retained. I suspect the (somewhat hacky) heuristic needs some major tuning. I don't know what it's keying into and why it works for short read data and not long read, but it's likely overly sensitive to either error rates or read lengths, so the long term solution is probably to rewrite it. (I'm OK with a short term fix to optionally simply skip this line of code though.)

jkbonfield avatar Jan 28 '22 09:01 jkbonfield

This is now restructured into a separate option --no-indelQ-tweaks. Sadly, it does not help always, still investigating..

pd3 avatar Feb 08 '22 14:02 pd3

My big concern with this is whether it'll actually last more than a few months as a half-solution.

It's not working right now on all the data files, so we need to find a better fix, which I'm researching. It's quite possible that an improved consensus construction algorithm in the bam2bcf_indel code may make this option completely redundant.

That work isn't going to appear in our next release, so if this option worked more robustly then I'd be inclined to add it anyway. However the fact it seems somewhere around a 50/50 success rate at rescuing these missing indels, while also having a major jump to false positives, makes me wonder if we should simply reject this in favour of a better implementation down the road.

jkbonfield avatar Feb 11 '22 09:02 jkbonfield

In the light of all recent tests I agree, this is not a viable solution.

pd3 avatar Feb 11 '22 09:02 pd3