Add tests for genotype imputation
Description
Add toy examples for testing imputation. Results obtained by running BEAGLE 4.1 are stored for comparison.
Fixes #2802
PR Checklist:
- [x] Implement BEAGLE's interpolation-style imputation algorithm in Python.
- [x] Implement a simpler version of the BEAGLE algorithm using functionalities in
_tskit(tsimpute). - [ ] Develop toy datasets for comparing BEAGLE and tskit-based imputation.
- [x] Record BEAGLE parameters and results (genotype probabilities).
- [ ] Implement functions allowing for convenient access to the toy datasets and BEAGLE results.
- [ ] Do test runs to profile performance.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 76.97%. Comparing base (
8b1be4f) to head (24d7a49).
Additional details and impacted files
@@ Coverage Diff @@
## main #2815 +/- ##
===========================================
- Coverage 89.79% 76.97% -12.82%
===========================================
Files 30 30
Lines 30399 30399
Branches 5909 5643 -266
===========================================
- Hits 27296 23400 -3896
- Misses 1778 5614 +3836
- Partials 1325 1385 +60
| Flag | Coverage Δ | |
|---|---|---|
| c-tests | 86.21% <ø> (ø) |
|
| lwt-tests | 80.78% <ø> (ø) |
|
| python-c-tests | 67.71% <ø> (ø) |
|
| python-tests | ? |
Flags with carried forward coverage won't be shown. Click here to find out more.
see 13 files with indirect coverage changes
Continue to review full report in Codecov by Sentry.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 8b1be4f...24d7a49. Read the comment docs.
I think the tricky is getting the normalization factors right.
So we might not get exactly the same values, but I guess would expect the output matrices to be proportional (since normalisation factors can be arbitrary)
I've updated this to get the tskit code to run @szhan, and to remove the dependence on pandas for CSV parsing (which would have been a pain to update all the requirements.txt files). We can do the same thing (more or less) with numpy.
Here is the first stab at the comparison.
Rows = sites Columns = ref. haplotypes
Forward probability matrix tskit
[[0.23333333 0.03333333 0.23333333 0.23333333 0.03333333 0.23333333]
[0.24637681 0.00724638 0.24637681 0.24637681 0.00724638 0.24637681]
[0.2384058 0.02318841 0.2384058 0.2384058 0.02318841 0.2384058 ]
[0.15942246 0.18115508 0.15942246 0.15942246 0.18115508 0.15942246]
[0.05073599 0.39852802 0.05073599 0.05073599 0.39852802 0.05073599]]
beagle
[[9.9990e-01 1.0000e-04 9.9990e-01 9.9990e-01 1.0000e-04 9.9990e-01]
[1.6665e-01 1.7000e-05 1.6665e-01 1.6665e-01 1.7000e-05 1.6665e-01]
[1.7000e-05 1.6665e-01 1.7000e-05 1.7000e-05 1.6665e-01 1.7000e-05]
[1.7000e-05 1.6665e-01 1.7000e-05 1.7000e-05 1.6665e-01 1.7000e-05]]
Backward probability matrix tskit
[[0.97527473 1.34615385 0.97527473 0.97527473 1.34615385 0.97527473]
[0.88461538 8.84615385 0.88461538 0.88461538 8.84615385 0.88461538]
[0.58974359 9.43589744 0.58974359 0.58974359 9.43589744 0.58974359]
[0.38017094 2.09094017 0.38017094 0.38017094 2.09094017 0.38017094]
[1. 1. 1. 1. 1. 1. ]]
beagle
[[0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]
[0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]
[0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]
[0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]]
The forward values look pretty proportional to me. But I don't get why the values in the BEAGLE backward matrix is so uniform.
I suspect it has to do with my compiled copy of BEAGLE 4.1. BEAGLE's forward-backward algorithm only keeps track of the values for one site/marker in order to reduce memory requirements. For some reasons, I'm not seeing it being updated across the iterations.
Also, I'm seeing that BEAGLE is initialising the backward probability matrix to 1/n, where n = number of haplotypes in the reference panel.
See in LSHapBaum.java.
public HapAlleleProbs randomHapSample(int hap) {
Arrays.fill(alleleProbs, 0f);
int nMarkers = impData.nClusters();
windowIndex = 0;
arrayIndex = -1;
setForwardValues(0, nMarkers, hap);
Arrays.fill(bwdVal, 1.0f/n);
setStateProbs(nMarkers-1, currentIndex());
for (int m=nMarkers-2; m>=0; --m) {
setBwdValue(m, hap);
setStateProbs(m, previousIndex(hap));
}
setAlleleProbs(alleleProbs);
return new LowMemHapAlleleProbs(refMarkers, impData.targetSamples(),
hap, alleleProbs);
}
See this line Arrays.fill(bwdVal, 1.0f/n);.
OK, I guess we'll need to figure this out. But, we should be able to check that the final imputed value (info scores?) are equal, easily enough?
There's also the matrix that combines the forward and backward values. I think BEAGLE gets the allele probabilities out of that.
I just discussed with Duncan. He spotted that the uniform backward values has to do with parametrization.
I think we should just focus on the interpolated allele probabilities at all the imputed markers on query haplotypes. We can compare these allele probabilities from BEAGLE and our implementation here.
Just to summarise the outputs from BEAGLE we can probably use for comparison here:
- Forward probability matrix at genotyped markers.
- Backward probability matrix at genotyped markers.
- HMM state probability matrix at genotyped markers.
- Interpolated allele probabilities at imputed markers.
It is hard to replicate exactly what BEAGLE is doing, so at best we will be able to look at correlations between the values (as discussed above).
Can you ping me please when you want some update on this @szhan? I'm going to mute for now (I get notified on each update)
I've been looking at the interpolated allele probabilities computed by BEAGLE 4.1 (see the function setAlleleProbs in LSHapBaum.java). When the probability of one allele is near 1, then it skips computing the probability of the other, which is set to 0 during array initialization. So, it should be kept in mind when comparing the BEAGLE allele probabilities with the allele probabilities we are computing.
I still need to figure out how to get test_compare_allele_probabilities to pass for test data 4 to 8.
I did a test run of run_tsimpute on a subset of the FinnGen data, which has reference haplotypes from ~8.5k individuals and query haplotypes from ~2.4k individuals (70,640 markers in total, of which 2,722 are genotyped). cProfile.profile shows the following:
ncalls tottime percall cumtime percall filename:lineno(function)
1 141.395 141.395 141.395 141.395 {method 'backward_matrix' of '_tskit.LsHmm' objects}
1 68.479 68.479 68.479 68.479 {method 'forward_matrix' of '_tskit.LsHmm' objects}
1 20.971 20.971 20.972 20.972 beagle_numba.py:220(interpolate_allele_probabilities)
2 4.429 2.215 4.429 2.215 {method 'decode' of '_tskit.CompressedMatrix' objects}
1 0.814 0.814 0.814 0.814 beagle_numba.py:192(compute_state_probability_matrix)
1 0.000 0.000 0.005 0.005 beagle_numba.py:261(get_map_alleles)
I spent a lot of time through going BEAGLE 4.1's logic in recoding alleles, most of which is to reduce computation time and memory usage (e.g. imputing IBS segments instead of individual sites and storing only non-trivial state probabilities in the hidden state probability matrix). The logic makes it hard to align the forward, backward, and state probability matrices to those we get from _tskit.lshmm. I moved on to the BEAGLE 5.4 source code, which is heavily refactored and improved. I have finally managed to align the probability matrices, I think.