tskit Handle missing data in two-locus statistics

Handle missing data in two-locus statistics

Open lkirk opened this issue 2 years ago • 1 comments

As mentioned here by @petrelharp during the review of #2805, we'd like a better treatment of missing data. As implemented, we compute $w_{AB}$, $w_{Ab}$, $w_{aB}$, but use the total number of samples in the tree sequence as $n$. If there's missing data, $n$ will not be correct. We should implement $n$ as $n=w_{AB}+w_{Ab}+w_{aB}+w_{ab}$ so that we can properly account for missing data. This means that $n$ will be the minimum number of samples intersecting with the sample set at the left locus and the right locus.

This will require a bit of restructuring because we will either need to intersect all samples with the samples of the current valid tree on the left and right or we'll want to seed the algorithm that propagates sample bit arrays across alleles.

Aug 30 '23 02:08 lkirk

We also don't really handle missing data in the standard tree stats API, so I'd be happy to kick this (tricky) can down the road.

Aug 30 '23 08:08 jeromekelleher

Closing for inactivity and labelling "future", please re-open if you plan to work on this.

Jun 12 '25 22:06 benjeffery

tskit tskit copied to clipboard

Handle missing data in two-locus statistics

tskit
tskit copied to clipboard