tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Handle missing data in two-locus statistics

Open lkirk opened this issue 2 years ago • 1 comments

As mentioned here by @petrelharp during the review of #2805, we'd like a better treatment of missing data. As implemented, we compute $w_{AB}$, $w_{Ab}$, $w_{aB}$, but use the total number of samples in the tree sequence as $n$. If there's missing data, $n$ will not be correct. We should implement $n$ as $n=w_{AB}+w_{Ab}+w_{aB}+w_{ab}$ so that we can properly account for missing data. This means that $n$ will be the minimum number of samples intersecting with the sample set at the left locus and the right locus.

This will require a bit of restructuring because we will either need to intersect all samples with the samples of the current valid tree on the left and right or we'll want to seed the algorithm that propagates sample bit arrays across alleles.

lkirk avatar Aug 30 '23 02:08 lkirk

We also don't really handle missing data in the standard tree stats API, so I'd be happy to kick this (tricky) can down the road.

jeromekelleher avatar Aug 30 '23 08:08 jeromekelleher

Closing for inactivity and labelling "future", please re-open if you plan to work on this.

benjeffery avatar Jun 12 '25 22:06 benjeffery