tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Add tests for genotype imputation

Open szhan opened this issue 2 years ago • 19 comments

Description

Add toy examples for testing imputation. Results obtained by running BEAGLE 4.1 are stored for comparison.

Fixes #2802

PR Checklist:

  • [x] Implement BEAGLE's interpolation-style imputation algorithm in Python.
  • [x] Implement a simpler version of the BEAGLE algorithm using functionalities in _tskit (tsimpute).
  • [ ] Develop toy datasets for comparing BEAGLE and tskit-based imputation.
  • [x] Record BEAGLE parameters and results (genotype probabilities).
  • [ ] Implement functions allowing for convenient access to the toy datasets and BEAGLE results.
  • [ ] Do test runs to profile performance.

szhan avatar Aug 07 '23 17:08 szhan

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 76.97%. Comparing base (8b1be4f) to head (24d7a49).

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #2815       +/-   ##
===========================================
- Coverage   89.79%   76.97%   -12.82%     
===========================================
  Files          30       30               
  Lines       30399    30399               
  Branches     5909     5643      -266     
===========================================
- Hits        27296    23400     -3896     
- Misses       1778     5614     +3836     
- Partials     1325     1385       +60     
Flag Coverage Δ
c-tests 86.21% <ø> (ø)
lwt-tests 80.78% <ø> (ø)
python-c-tests 67.71% <ø> (ø)
python-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 13 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8b1be4f...24d7a49. Read the comment docs.

codecov[bot] avatar Aug 07 '23 17:08 codecov[bot]

I think the tricky is getting the normalization factors right.

szhan avatar Aug 08 '23 08:08 szhan

So we might not get exactly the same values, but I guess would expect the output matrices to be proportional (since normalisation factors can be arbitrary)

jeromekelleher avatar Aug 08 '23 11:08 jeromekelleher

I've updated this to get the tskit code to run @szhan, and to remove the dependence on pandas for CSV parsing (which would have been a pain to update all the requirements.txt files). We can do the same thing (more or less) with numpy.

jeromekelleher avatar Aug 09 '23 09:08 jeromekelleher

Here is the first stab at the comparison.

Rows = sites Columns = ref. haplotypes

Forward probability matrix tskit

[[0.23333333 0.03333333 0.23333333 0.23333333 0.03333333 0.23333333]
 [0.24637681 0.00724638 0.24637681 0.24637681 0.00724638 0.24637681]
 [0.2384058  0.02318841 0.2384058  0.2384058  0.02318841 0.2384058 ]
 [0.15942246 0.18115508 0.15942246 0.15942246 0.18115508 0.15942246]
 [0.05073599 0.39852802 0.05073599 0.05073599 0.39852802 0.05073599]]

beagle

[[9.9990e-01 1.0000e-04 9.9990e-01 9.9990e-01 1.0000e-04 9.9990e-01]
 [1.6665e-01 1.7000e-05 1.6665e-01 1.6665e-01 1.7000e-05 1.6665e-01]
 [1.7000e-05 1.6665e-01 1.7000e-05 1.7000e-05 1.6665e-01 1.7000e-05]
 [1.7000e-05 1.6665e-01 1.7000e-05 1.7000e-05 1.6665e-01 1.7000e-05]]

Backward probability matrix tskit

[[0.97527473 1.34615385 0.97527473 0.97527473 1.34615385 0.97527473]
 [0.88461538 8.84615385 0.88461538 0.88461538 8.84615385 0.88461538]
 [0.58974359 9.43589744 0.58974359 0.58974359 9.43589744 0.58974359]
 [0.38017094 2.09094017 0.38017094 0.38017094 2.09094017 0.38017094]
 [1.         1.         1.         1.         1.         1.        ]]

beagle

[[0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]
 [0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]
 [0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]
 [0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]]

The forward values look pretty proportional to me. But I don't get why the values in the BEAGLE backward matrix is so uniform.

szhan avatar Aug 09 '23 12:08 szhan

I suspect it has to do with my compiled copy of BEAGLE 4.1. BEAGLE's forward-backward algorithm only keeps track of the values for one site/marker in order to reduce memory requirements. For some reasons, I'm not seeing it being updated across the iterations.

szhan avatar Aug 09 '23 14:08 szhan

Also, I'm seeing that BEAGLE is initialising the backward probability matrix to 1/n, where n = number of haplotypes in the reference panel.

See in LSHapBaum.java.

public HapAlleleProbs randomHapSample(int hap) {
        Arrays.fill(alleleProbs, 0f);
        int nMarkers = impData.nClusters();
        windowIndex = 0;
        arrayIndex = -1;
        setForwardValues(0, nMarkers, hap);
        Arrays.fill(bwdVal, 1.0f/n);
        setStateProbs(nMarkers-1, currentIndex());
        for (int m=nMarkers-2; m>=0; --m) {
            setBwdValue(m, hap);
            setStateProbs(m, previousIndex(hap));
        }
        setAlleleProbs(alleleProbs);
        return new LowMemHapAlleleProbs(refMarkers, impData.targetSamples(),
                hap, alleleProbs);
    }

See this line Arrays.fill(bwdVal, 1.0f/n);.

szhan avatar Aug 09 '23 14:08 szhan

OK, I guess we'll need to figure this out. But, we should be able to check that the final imputed value (info scores?) are equal, easily enough?

jeromekelleher avatar Aug 09 '23 15:08 jeromekelleher

There's also the matrix that combines the forward and backward values. I think BEAGLE gets the allele probabilities out of that.

szhan avatar Aug 09 '23 15:08 szhan

I just discussed with Duncan. He spotted that the uniform backward values has to do with parametrization.

szhan avatar Aug 09 '23 20:08 szhan

I think we should just focus on the interpolated allele probabilities at all the imputed markers on query haplotypes. We can compare these allele probabilities from BEAGLE and our implementation here.

szhan avatar Aug 31 '23 09:08 szhan

Just to summarise the outputs from BEAGLE we can probably use for comparison here:

  • Forward probability matrix at genotyped markers.
  • Backward probability matrix at genotyped markers.
  • HMM state probability matrix at genotyped markers.
  • Interpolated allele probabilities at imputed markers.

It is hard to replicate exactly what BEAGLE is doing, so at best we will be able to look at correlations between the values (as discussed above).

szhan avatar Aug 31 '23 09:08 szhan

Can you ping me please when you want some update on this @szhan? I'm going to mute for now (I get notified on each update)

jeromekelleher avatar Sep 01 '23 15:09 jeromekelleher

I've been looking at the interpolated allele probabilities computed by BEAGLE 4.1 (see the function setAlleleProbs in LSHapBaum.java). When the probability of one allele is near 1, then it skips computing the probability of the other, which is set to 0 during array initialization. So, it should be kept in mind when comparing the BEAGLE allele probabilities with the allele probabilities we are computing.

szhan avatar Sep 06 '23 11:09 szhan

I still need to figure out how to get test_compare_allele_probabilities to pass for test data 4 to 8.

szhan avatar Sep 06 '23 19:09 szhan

I did a test run of run_tsimpute on a subset of the FinnGen data, which has reference haplotypes from ~8.5k individuals and query haplotypes from ~2.4k individuals (70,640 markers in total, of which 2,722 are genotyped). cProfile.profile shows the following:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1  141.395  141.395  141.395  141.395 {method 'backward_matrix' of '_tskit.LsHmm' objects}
        1   68.479   68.479   68.479   68.479 {method 'forward_matrix' of '_tskit.LsHmm' objects}
        1   20.971   20.971   20.972   20.972 beagle_numba.py:220(interpolate_allele_probabilities)
        2    4.429    2.215    4.429    2.215 {method 'decode' of '_tskit.CompressedMatrix' objects}
        1    0.814    0.814    0.814    0.814 beagle_numba.py:192(compute_state_probability_matrix)
        1    0.000    0.000    0.005    0.005 beagle_numba.py:261(get_map_alleles)

szhan avatar Sep 18 '23 17:09 szhan

I spent a lot of time through going BEAGLE 4.1's logic in recoding alleles, most of which is to reduce computation time and memory usage (e.g. imputing IBS segments instead of individual sites and storing only non-trivial state probabilities in the hidden state probability matrix). The logic makes it hard to align the forward, backward, and state probability matrices to those we get from _tskit.lshmm. I moved on to the BEAGLE 5.4 source code, which is heavily refactored and improved. I have finally managed to align the probability matrices, I think.

szhan avatar Jan 12 '24 16:01 szhan