tskit icon indicating copy to clipboard operation
tskit copied to clipboard

add watterson's theta

Open mufernando opened this issue 3 years ago • 8 comments

this would be simple, as watterson's theta is

ts.segregating_sites()/sum([1/i for i in range(1, ts.num_samples])

mufernando avatar Jun 23 '21 18:06 mufernando

PR welcome!

jeromekelleher avatar Jun 24 '21 09:06 jeromekelleher

Should it be implemented as TreeSequence.wattersons_theta()?

def wattersons_theta(self):
    return(self.segregating_sites() / np.sum([1/i for i in np.arange(1, self.num_samples)]))

szhan avatar Jun 01 '22 12:06 szhan

Should there be additional arguments sample_sets=None and windows=None like TreeSequence.diversity()?

szhan avatar Jun 01 '22 12:06 szhan

Probably, I'm not sure without looking things up @szhan. Here's some other implementations to check against:

  • https://scikit-allel.readthedocs.io/en/stable/stats/diversity.html#allel.watterson_theta
  • http://molpopgen.github.io/pylibseq/docs/_build/html/pages/summstats.html#libsequence.thetaw

jeromekelleher avatar Jun 01 '22 13:06 jeromekelleher

There's a slight caveat here: the denominator only holds if all samples have the same birth time (even when all of the other assumptions are met).

molpopgen avatar Jun 01 '22 17:06 molpopgen

Upon some further reflection: this caveat is no worse than applying the Tajma's D stat to a sample set of heterogeneous ages. So feel free to ignore me!

molpopgen avatar Jun 01 '22 17:06 molpopgen

This should go in the same place as ts.segregating_sites, and have all the args that segregating_sites( ) does - it's just a python-only wrapper around segregating_sites. I think that's actually in trees.py as it's a method of a TreeSequence; stats.py is an old way of doing things.

Your proposal is good - the main thing is writing some tests - just see how Tajima's D or some other python-only statistic is tested.

petrelharp avatar Jun 02 '22 01:06 petrelharp

I think that the tests for ts.Wattersons_theta() should be similar to the tests implemented for ts.segregating_sites() or another statistic that has the site, node, and branch modes.

szhan avatar Jun 27 '22 12:06 szhan