Deal with missing data in stats
The possibility of missing data is being added in #272; the stats need to be modified to take this into account. At least two issues:
- Site stats: isolated samples will be assumed to have the ancestral allele, wrongly.
- All stats: the denominators should count up the number of nonmissings.
A nice thing about the current definitions is that many stats won't need modifying the numerator to account for missing data.
Note that we might need to do something with isolated samples with a mutation above them (which are not meant to be missing). See https://github.com/tskit-dev/tskit/issues/2037#issuecomment-989304089
Should we currently error out if we detect missing data in a stats calculation, until this issue is fixed?
We probably should, putting into tskit 0.4.1
A demo example from https://github.com/hyanwong/ancestor-PCA/issues/1: you might hope to get the same pairwise information out of all these 3 tree sequences (in the last, each pair occurs in each tree)
Revisiting this, also the branch-length distance between two samples that are isolated from each other in the topology should (IMO) be NaN or infinity. But currently, e.g. two isolated samples have a distance of 0 between them:
empty_ts = tskit.Tree.generate_comb(3).tree_sequence.delete_intervals([[0, 1]])
assert empty_ts.divergence([[0],[1]], mode="branch") == 0
This presumably applies to a number of other branch-length stats too.