cortex
cortex copied to clipboard
Fix known bugs in genotyping and look at confidence distribution
To fix:
- poisson error rate uses estimates of total read count - prefer to use median depth
- improve gt confidence
At this commit on branch fix_poisson:
commit 85c17ccffaeb14c691d489dbaaf11e144e310cb6 Author: Zamin Iqbal [email protected] Date: Tue Nov 17 13:51:44 2015 +0000
Add script for getting confusion matrices
Results are looking pretty interesting.
I've taken a binary of NA12878 (this is a human sample, and have used Platinum genome data, cleaning with threshold 7), and intersected with 5000 random OMNI sites. I then compare with the 1000g calls at those sites as truth
I then calculate confusion matrices of genotyping for v1.0.5.21 of Cortex and for this commit, and compare them. To clarify the signal, I produce confusion matrices for calls with genotype confidence between 0 and 9, and for those between 10 and 19, and... Here's some results. Left column is v1.0.5.21 and right is tip of fix_poisson

Notice that we now put more stuff in the low confidence bin, but by the time confidence>10, in the new commit, we lose much fewer hets, but we do call more as missing (which is bizarre). So apart from the extra missingness, I am v happy
Oh no it's OK! There are a bunch of sites which we call as missing - in the old commit they have low confidence and now they have higher. I think the missingness is likely because these OMNI sites are actually adjacent to indels. There is a separate question as to what the confidence of a ./. call is - will look into this
Merged into master here
https://github.com/iqbal-lab/cortex/commit/85c17ccffaeb14c691d489dbaaf11e144e310cb6
I just want to check the genotyping of longer alleles and then ok to close. All unit tests pass including with valgrind.
'There is a separate question as to what the confidence of a ./. call is - will look into this' Would indeed be great if we could distinguish between truly missing (deletion) and uncertainty in calling. Back to my lectures now, just had to add this.
OK, so in two parts.
Why does a ./. call have a confidence?
When Cortex genotypes a bubble, you pass it a model. That model is diploid or haploid at the moment, and allows you to be either hom-ref, het, or hom-alt (for diploid, haploid is either hom-ref or hom-alt). Now suppose one allele is longer than another. And suppose we have zero covg on both alleles. Well then, the most likely model is homozygous for the shorter allele, as more chance of having zero covg on two short alleles. So, Cortex types it thus. Then, a wrapper script converts the cortex output to a VCF, and forcibly sets genotype to ./. if there is zero coverage on both alleles, as that is what most consumers want, but leaves the confidence unchanged.
Can we distinguish truly missing from uncertainty in calling If, Henk, you mean to ask us to support the possibility, at any site, of the whole locus being deleted - I guess we could do that, but currently do not. One reason for going back over all samples and genotyping at all sites is to be able to make a call on SNPs/whatever that were not called in person-X, to see if we think they are hom-ref, or actually het/hom-alt, but had not been called previously due to low confidence.