tskit icon indicating copy to clipboard operation
tskit copied to clipboard

More flexible interpretation of mutation rate when getting emission probability when using `_tskit.lshmm`

Open szhan opened this issue 1 year ago • 4 comments

Currently, the emission probability for calculating the forward, backward, and Viterbi matrices is defined as follows:

p_e = mu;
if (is_match) {
    p_e = 1 - (num_alleles - 1) * mu;
}

This implies that an allele at a site mutates to another allowed allele (num_alleles - 1) uniformly at random at a user-specified mutation rate, mu. Changing the formula for p_e to p_e = 1 - mu upon allele match should allow for more flexible (general?) interpretation of the mutation rate.

Relevant lines are the following: https://github.com/tskit-dev/tskit/blob/2dae133fd8c3414d2f9b5073b964e37b4f96e032/c/tskit/haplotype_matching.c#L1055 https://github.com/tskit-dev/tskit/blob/2dae133fd8c3414d2f9b5073b964e37b4f96e032/c/tskit/haplotype_matching.c#L1106 https://github.com/tskit-dev/tskit/blob/2dae133fd8c3414d2f9b5073b964e37b4f96e032/c/tskit/haplotype_matching.c#L1295

szhan avatar Jan 17 '24 14:01 szhan

Shouldn't the probability of match and probability of mismatch sum to 1? If that is the case, then I suspect there is a bug.

In this code snippet:

p_e = mu;
if (is_match) {
    p_e = 1 - (num_alleles - 1) * mu;
}

Probability of mismatch = mu Probability of match = 1 - (num_alleles - 1) * mu

Their sum is not 1 when num_alleles > 2.

Probability of match should be (1 - mu), unless probability of mismatch is (num_alleles - 1) * mu.

EDIT: Please disregard the above. I didn't realise it's the Rosen & Paten (2019) formulation, which rescales the mutation rate based on the number of alleles at the site. So, not a bug.

szhan avatar Jan 23 '24 19:01 szhan

I agree it is a weird parameterisation, and doesn't belong in the core HMM. We should make this type of rescaling optional at the Python level.

jeromekelleher avatar Jan 24 '24 10:01 jeromekelleher

I'm just looking at how computation of emission probability is done in test_haplotype_matching.py. It seems like it is going with a slightly different interpretation of mu (when there is no scaling based on the number of alleles)?

if is_match:
    p_e = 1 - mu
else:
    p_e = mu / (n_alleles - 1)

https://github.com/tskit-dev/tskit/blob/2dae133fd8c3414d2f9b5073b964e37b4f96e032/python/tests/test_haplotype_matching.py#L382

szhan avatar Mar 03 '24 14:03 szhan

Maybe mu has been rescaled somewhere else? There may well be inconsistencies here, though

jeromekelleher avatar Mar 04 '24 11:03 jeromekelleher