tskit
tskit copied to clipboard
Handle missing data in two-locus statistics
As mentioned here by @petrelharp during the review of #2805, we'd like a better treatment of missing data. As implemented, we compute $w_{AB}$, $w_{Ab}$, $w_{aB}$, but use the total number of samples in the tree sequence as $n$. If there's missing data, $n$ will not be correct. We should implement $n$ as $n=w_{AB}+w_{Ab}+w_{aB}+w_{ab}$ so that we can properly account for missing data. This means that $n$ will be the minimum number of samples intersecting with the sample set at the left locus and the right locus.
This will require a bit of restructuring because we will either need to intersect all samples with the samples of the current valid tree on the left and right or we'll want to seed the algorithm that propagates sample bit arrays across alleles.
We also don't really handle missing data in the standard tree stats API, so I'd be happy to kick this (tricky) can down the road.
Closing for inactivity and labelling "future", please re-open if you plan to work on this.