bcftools Improvements to indel calling.

The bcf_call_glfgen code had a heuristic to change indel type calls to 0 (REF) where the read doesn't have an indel at that location, under the assumption our assessment was bad.

This may have been helpful in the past, but with better consensus calculation our type calls are now more robust. Instead we change such data to be unclassified types instead so they don't count to AD and PL.

This is still a fudge, but overall it's beneficial. We get a 7% rise in FP value, but a 2% drop in FN and huge 16% drop in GT assignment errors.
Improved bcf_cgp_find_types, which determines the number of candidate indel sizes present, to consider per_sample_flt and min_fract parameters. Previously these were only filtered out after every read has been realigned, but filtering earlier reduces the number of incorrect indel assignments and improves accuracy.

It likely also the cause of PacBio calling speeding up, as we have a low-level of indel errors in the data.
Replaced the old bcf_cgp_ref_sample code with a new bcf_cgp_consensus algorithm.

The original logic starting with the real reference and by walking through read CIGAR strings amending any SNPs. It did this once per sample. Then per hypothetical indel type, this "sample reference" was amended at the indel site itself by inserting or deleting bases. Each read was then aligned against this modified ref to evaluate how well it matched.

Unfortunately this is simply incorrect when there are neighbouring indels present, as only the central under-investigation indel ever made it into the modified reference.

The new code produces up to two consensus sequences per sample, based on all of the SEQ+CIGAR elements (including indels). This copes with evaluating e.g. a homozygous insertion flanked by a heterozygous deletion (hence two consensuses). A small dose of real reference is applied to, to cope with low coverage artefacts.

This is the heart of these indel calling changes and the primary reason for fixing some obvious looking indel calls that were previously missed. However it's still not a replacement for a full denovo reassembly and alignment errors can still lead to incorrect results, but by and large we now fail on messy data instead of occasionally on good.
bcf_cgp_align_score can now be given two references, and we return the alignment score that's best. The band sizing is also a little different, now operating on a band related to all observed indel types and not just the specific indel type being assessed and also taking into account the total insertions and deletions in nearby positions gathered during consensus construction.
Improved est_seqQ. This is a naive sequence quality assignment based on the size of indel vs the size of homopolymer at that position. This is reasonable for a deletion, but for an insertion it didn't check that the inserted sequence actually matched the homopolymer base type.

TODO: homopolymer is too naive. It should be looking at indel in STR repeat unit terms.
Improved the indel window-size adjustments for long reads. This worked because long reads rarely have alignment end effects where the aligner has produced reference bias by soft-clipping a few trailing bases. However it's now resized more accurately based on the total proportion of bases covered by STRs within the window. On average this is smaller than the old "indel_win_size/2" method so further speeds up alignment of long read data.
Lots of additional commenting. Some explaining how things work, and a few with "TODO" ideas for potential improvements to explore.
The various bits of commented out debugging code are now more cleanly wrapped up in a CONS_DEBUG define, to ease enabling and disabling.
Fixed a few minor memory leaks on failure. These likely never occurred, or if they did were immediately followed by program exit anyway.
Reduced the default indel_win_size from 110 to 80. With better consensus construction it was found that a smaller value worked better. Also there is no longer a hard lower limit of 110. It's now possible to override this down to 20.
Improved the find_STR function to be able to spot repeat units up to 14 bases. It was previously 8.

Please advise on whether you wish it squashed. It'll be a huge change then, but equally so it's not so useful to look at all the commits as they contain ideas which were subsequently removed again. Squashing just some of it, and in particular reordering to group things together, will be problematic as it'll lead to a myriad of conflicts due to adding and removing exploratory code.

Replaces https://github.com/samtools/bcftools/pull/1648

Some stats on GIAB HG002 chr1 truth set. Showing percentages for all data, QUAL>=30 and QUAL>=30 with additional DP filtering.

15x dev  (only 1-169MB as lacking in dev)
InDel TP   94.07 /   90.28 /   88.21
InDel FP    1.47 /    1.04 /    0.73
InDel GT    2.39 /    2.20 /    2.11
InDel FN    5.93 /    9.72 /   11.79

15x new (1-169MB)
InDel TP   94.48 /   91.35 /   89.18
InDel FP    1.29 /    0.89 /    0.76 --/+
InDel GT    2.45 /    2.25 /    2.18  +/~
InDel FN    5.52 /    8.65 /   10.82 --/--

15x new (all chr)
InDel TP   94.73 /   91.64 /   89.56
InDel FP    1.26 /    0.91 /    0.79
InDel GT    2.38 /    2.19 /    2.12
InDel FN    5.27 /    8.36 /   10.44

--------------------
30x dev
InDel TP   97.94 /   96.99 /   96.80
InDel FP    1.08 /    0.83 /    0.69
InDel GT    1.04 /    0.99 /    0.98
InDel FN    2.06 /    3.01 /    3.20

30x new
InDel TP   98.11 /   97.36 /   97.18
InDel FP    0.95 /    0.77 /    0.68  -/~
InDel GT    0.97 /    0.90 /    0.89  -/-
InDel FN    1.89 /    2.64 /    2.82  -/--

--------------------
60x dev
InDel TP   98.59 /   98.35 /   98.32
InDel FP    0.84 /    0.68 /    0.62
InDel GT    0.75 /    0.74 /    0.74
InDel FN    1.41 /    1.65 /    1.68

60x new
InDel TP   98.74 /   98.58 /   98.54
InDel FP    0.80 /    0.67 /    0.61  -/~
InDel GT    0.53 /    0.51 /    0.50 --/--
InDel FN    1.26 /    1.42 /    1.46 --/--

--------------------
150x dev
InDel TP   98.80 /   98.72 /   98.70
InDel FP    0.81 /    0.72 /    0.61
InDel GT    0.58 /    0.56 /    0.56
InDel FN    1.20 /    1.28 /    1.30

150x new
InDel TP   98.88 /   98.84 /   98.80
InDel FP    0.68 /    0.62 /    0.55 --/-
InDel GT    0.35 /    0.35 /    0.35 --/--
InDel FN    1.12 /    1.16 /    1.20  -/-

30x ROC curve (FP/FN percentage rates)

Edit: culling 15x plot as it erroneously was for all of chr1 despite missing a chunk in my run of develop branch. Reproducing this VCF now.

Mar 15 '22 17:03 jkbonfield

Some performance figures on GIAB HG002.GRCh38.PacBio_CCS_15Kb.bam (approx 30x coverage) data:

develop:
SNP          Q>0 /   Q>=30 / Filtered
SNP   TP   99.97 /   99.93 /   99.93
SNP   FP    1.25 /    0.88 /    0.88
SNP   GT    0.19 /    0.19 /    0.19
SNP   FN    0.03 /    0.07 /    0.07

InDel TP   97.84 /   96.09 /   96.09
InDel FP   49.14 /   16.55 /   16.48
InDel GT   13.91 /   13.88 /   13.88
InDel FN    2.16 /    3.91 /    3.91


new:
SNP          Q>0 /   Q>=30 / Filtered
...
InDel TP   98.07 /   95.73 /   95.73
InDel FP   41.06 /   12.42 /   12.36
InDel GT   12.58 /   12.53 /   12.53
InDel FN    1.93 /    4.27 /    4.27

There is a slight shift in FN vs FP figures here at Q30 cutoff, but the new code at a slightly reduced Q25 shows it beats develop on all 3 of FP, GT and FN figures.

new at Q25:
InDel TP   98.07 /   96.43 /   96.43
InDel FP   41.06 /   15.59 /   15.53
InDel GT   12.58 /   12.56 /   12.56
InDel FN    1.93 /    3.57 /    3.57

In both develop and this PR, it's still very much favouring recall over precision for the CCS data. We get a much more balanced starting point with -h 200, but overall the ROC curve is poorer so we're better off taking the default of -h 200 and just setting a slightly higher QUAL threshold to filter instead (eg Q40 is a more even recall vs preicison tradeoff).

Mar 16 '22 09:03 jkbonfield

I see my tweak to filter bcf_cgp_find_types candidates by the -m and -F parameters was incorrect and it was passing if -m or -F were accepted, whereas the actual filtering applied post removal is -m and -F conditions must be met. It worked initially, but at some point I incorrectly changed it. Given the defaults of -m 2 -F 0.05 on low coverage data this basically means anything is accepted still (initially) due to the low -F value.

Tweaking this (still to be pushed) gives these plots (tweak in "PR-b"):

[EDIT: fixed scaling as I had the wrong total number of indels in my gnuplot script.]

The effect of this filtering is we lose some recall, pretty much as expected. I am guessing it's the case where only a single alignment has an indel according to the CIGAR pileup, but post realignment the number of candidates reported matching that indel type is >= 2. Or possibly it's for GT cases 1/2 where -m and -F are the combined AD values for all types combined and not applied per type?

Regardless, given these are command line parameters that can be reduced if needed, the default behaviour looks very strong as a false positive removal mechanism. I'll do a bit more testing and then push an update (plus rebase to bring in the CI testing fix).

Mar 16 '22 15:03 jkbonfield

Updated with some tweaks to improve things further (as mentioned above):

We erroneously had the -m and -F pre-filter options applies as needing either to pass instead of both to pass in bcf_cgp_find_types. Fixing this is a significant reduction to FP rates.
The computation of which STRs spanned the indel being assessed was sometimes a little bit out, due to using "qpos" instead of "pos". This removes a few false negatives.
Improve the calculation of indel cost ("iscore") by counting the number of additional repeat units beyond the end of the sequence. The old penalty for STRs at the end of reads has been removed, meaning we regain some FNs while still penalising (more) the cases that lead to FPs.

Overall there is some small FP / FN adjustments, but the biggest improvement here is a significant drop in GT errors (~10% fewer).

PacBioCCS plot is being generated atm.

Mar 17 '22 16:03 jkbonfield

Some raw numbers, as percentages of the truth set. Columns are all (qual>0), QUAL>=30 and QUAL>=30 plus extra filters (IDV, IMF, DP).

15x dev
InDel TP   94.35 /   90.56 /   88.57
InDel FP    1.42 /    1.05 /    0.74
InDel GT    2.25 /    2.08 /    1.99
InDel FN    5.65 /    9.44 /   11.43

15x 9f10
InDel TP   93.95 /   91.10 /   89.13
InDel FP    0.95 /    0.63 /    0.52 -30%
InDel GT    2.45 /    2.29 /    2.21 +11%
InDel FN    6.05 /    8.90 /   10.87 -5%

30x dev
InDel TP   97.94 /   96.99 /   96.80
InDel FP    1.08 /    0.83 /    0.69
InDel GT    1.04 /    0.99 /    0.98
InDel FN    2.06 /    3.01 /    3.20

30x 9f10
InDel TP   98.01 /   97.31 /   97.13
InDel FP    0.87 /    0.69 /    0.61 -12%
InDel GT    0.85 /    0.80 /    0.79 -19%
InDel FN    1.99 /    2.69 /    2.87 -10%

60x dev
InDel TP   98.59 /   98.35 /   98.32
InDel FP    0.84 /    0.68 /    0.62
InDel GT    0.75 /    0.74 /    0.74
InDel FN    1.41 /    1.65 /    1.68

60x 9f10
InDel TP   98.72 /   98.57 /   98.54
InDel FP    0.76 /    0.63 /    0.58 -10%
InDel GT    0.43 /    0.42 /    0.41 -45%
InDel FN    1.28 /    1.43 /    1.46 -13%

The percentage change vs the develop branch is significant for FP and FN at all depths. GT assignment errors are up for shallow data, although in absolute numbers it's still fewer new GT errors than we've recovered in FN, so this is largely just an issue of low coverage true variants having a significant portion with the wrong genotype assignment. At higher depths the GT error really plummets.

Mar 17 '22 16:03 jkbonfield

PacBio CCS stats:

This broke at 112MB into chr1 as it found a 12KB deletion (now fixed). Rerunning, but the shape of the graph is unlikely to change. There are very high false positive and false negative rates on this data. We can use e.g. -h 200 to remove many of the low quality false positives, but it's generally best to simply filter harder instead. Most of the errors (both FN and FP) are homopolymer issues, which is simply the nature of the technology.

SynDip chr1 stats:

At 15x the recent additional commit has removed some true variants, so elevated FN rate, but it's the only change which is consistently better on FN vs FP tradeoff. Frankly all of them have terrible FN miss rates.

At 45x the new PR diverges and starts to become poorer than current develop due to higher FP rates. However this only occurs at QUAL >=40 or so and the FN rates are still so enormous as to not be practically usable. I think this is likely just the nature of the truth set, and I'm being overly pedantic in my accuracy assessment. If I recall there is a caveat in SynDip about not trusting small indels due to the nature of the underlying PacBio assembly and homopolymers, so it may simply be plain wrong on them.

Mar 17 '22 17:03 jkbonfield

Finally, assuming my revised pacbio test runs to completion, I think I'm done with this PR.

What I haven't tested is multi-sample calling, simply because I don't know of appropriate truth sets to use (maybe a trio though?) and we don't have a written best practice guide for it so I don't know the normal method of doing this. Do we just do it in one command, or produce gVCFs and merge? We really need to document this pipeline, but that's a job for someone who knows what they're doing to do. :-)

So I'll leave such validation up to you.

Mar 17 '22 17:03 jkbonfield

bcftools bcftools copied to clipboard

Improvements to indel calling.

bcftools
bcftools copied to clipboard