normalization for two-locus, multiallelic stats?
Over at #2805, @lkirk has implemented various two-locus stats, e.g. r^2, D, etcetera. The strategy for computing these is to sum something over all pairs of alleles (one allele from each at the two loci). As currently implemented, there is a "normalization" function so that the value for a given pair of loci is
\sum_i \sum_j F_{ij} W_{ij}
where F_{ij} is the summary function calculated for the pair of alleles, and W_{ij} is a weighting factor (the normalization).
@lkirk has proposed several weightings:
- uniform weighting (aka "total"); so
W_{ij} = 1/(number of pairs of alleles) - product of frequencies; so
W_{ij} = p_i p_j, wherep_iis the frequency of allele i - haplotype frequency; so W_{ij} = p_{ij}
, wherep_{ij}is the frequency of the combination of allelesiandj`.
To this let me add one more:
4. unweighted; so W_{ij} = 1.
I suspect we don't actually need the weights at all. For either (2) or (3), we can just incorporate the weight into the summary function (and, this is how the one-locus stats work). Uniform weighting seems like it has undesireable properties - for instance, adding a single new allele as a result of genotyping error could make the resulting value change by quite a lot. However, @lkirk's reports (see this notebook) that using uniform weighting gets the right answer for some statistics, while other ones do not. I haven't dug down into what's going on, so am filing the issue for us to think about later.
I guess my first question will be: in the example where we needed uniform weighting to get the right answer, can we just change the summary function so that "unweighted" gets the right answer?
And: @lkirk, please correct me if I've got some of this wrong!