tskit icon indicating copy to clipboard operation
tskit copied to clipboard

normalization for two-locus, multiallelic stats?

Open petrelharp opened this issue 2 years ago • 1 comments

Over at #2805, @lkirk has implemented various two-locus stats, e.g. r^2, D, etcetera. The strategy for computing these is to sum something over all pairs of alleles (one allele from each at the two loci). As currently implemented, there is a "normalization" function so that the value for a given pair of loci is

  \sum_i \sum_j F_{ij} W_{ij}

where F_{ij} is the summary function calculated for the pair of alleles, and W_{ij} is a weighting factor (the normalization).

@lkirk has proposed several weightings:

  1. uniform weighting (aka "total"); so W_{ij} = 1/(number of pairs of alleles)
  2. product of frequencies; so W_{ij} = p_i p_j, where p_i is the frequency of allele i
  3. haplotype frequency; so W_{ij} = p_{ij}, where p_{ij}is the frequency of the combination of allelesiandj`.

To this let me add one more: 4. unweighted; so W_{ij} = 1.

I suspect we don't actually need the weights at all. For either (2) or (3), we can just incorporate the weight into the summary function (and, this is how the one-locus stats work). Uniform weighting seems like it has undesireable properties - for instance, adding a single new allele as a result of genotyping error could make the resulting value change by quite a lot. However, @lkirk's reports (see this notebook) that using uniform weighting gets the right answer for some statistics, while other ones do not. I haven't dug down into what's going on, so am filing the issue for us to think about later.

I guess my first question will be: in the example where we needed uniform weighting to get the right answer, can we just change the summary function so that "unweighted" gets the right answer?

And: @lkirk, please correct me if I've got some of this wrong!

petrelharp avatar Aug 11 '23 18:08 petrelharp