tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Deal with missing data in stats

Open petrelharp opened this issue 6 years ago • 7 comments

The possibility of missing data is being added in #272; the stats need to be modified to take this into account. At least two issues:

  1. Site stats: isolated samples will be assumed to have the ancestral allele, wrongly.
  2. All stats: the denominators should count up the number of nonmissings.

A nice thing about the current definitions is that many stats won't need modifying the numerator to account for missing data.

petrelharp avatar Aug 06 '19 17:08 petrelharp

Note that we might need to do something with isolated samples with a mutation above them (which are not meant to be missing). See https://github.com/tskit-dev/tskit/issues/2037#issuecomment-989304089

hyanwong avatar Dec 08 '21 23:12 hyanwong

Should we currently error out if we detect missing data in a stats calculation, until this issue is fixed?

hyanwong avatar Dec 08 '21 23:12 hyanwong

We probably should, putting into tskit 0.4.1

jeromekelleher avatar Dec 09 '21 12:12 jeromekelleher

A demo example from https://github.com/hyanwong/ancestor-PCA/issues/1: you might hope to get the same pairwise information out of all these 3 tree sequences (in the last, each pair occurs in each tree)

image

hyanwong avatar Jul 13 '23 14:07 hyanwong

Revisiting this, also the branch-length distance between two samples that are isolated from each other in the topology should (IMO) be NaN or infinity. But currently, e.g. two isolated samples have a distance of 0 between them:

empty_ts = tskit.Tree.generate_comb(3).tree_sequence.delete_intervals([[0, 1]])
assert empty_ts.divergence([[0],[1]], mode="branch") == 0

This presumably applies to a number of other branch-length stats too.

hyanwong avatar Jul 14 '23 12:07 hyanwong