tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Feat : add reciprocal monophyly function to Tree class

Open Silloky opened this issue 1 year ago • 7 comments
trafficstars

Description

This PR adds a method to the Tree class in the Python API : is_recip_monophylectic(). It tests the current tree for reciprocal monophyly, a population genomics phenomenon as described here.

The key algorithm can be summarized like this : if each population's MRCA's children are the same as the all the population's leaves, all populations are reciprocally monophyletic.

PR Checklist:

  • [ ] Tests that fully cover new/changed functionality (tried, but I was faced with ModuleNotFoundError: No module named '_tskit' when importing tskit from my cloned fork)
  • [x] Documentation including tutorial content if appropriate. (code comments done, should I add to the official docs ?)

Also, this is only implemented in Python, not C, so someone needs to do that as I do not know C.

Silloky avatar Jul 24 '24 04:07 Silloky

Codecov Report

Attention: Patch coverage is 6.66667% with 14 lines in your changes missing coverage. Please review.

Project coverage is 90.45%. Comparing base (beafeba) to head (3bafdbf). Report is 150 commits behind head on main.

Files with missing lines Patch % Lines
python/tskit/trees.py 6.66% 14 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2974      +/-   ##
==========================================
+ Coverage   89.62%   90.45%   +0.82%     
==========================================
  Files          29       26       -3     
  Lines       30170    21619    -8551     
  Branches     5867     4372    -1495     
==========================================
- Hits        27041    19555    -7486     
+ Misses       1790     1160     -630     
+ Partials     1339      904     -435     
Flag Coverage Δ
c-tests 86.20% <ø> (ø)
lwt-tests ?
python-c-tests ?
python-tests 98.82% <6.66%> (-0.20%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
python/tskit/trees.py 98.20% <6.66%> (-0.59%) :arrow_down:

... and 3 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jul 24 '24 04:07 codecov[bot]

ModuleNotFoundError: No module named '_tskit' have you run make in the python directory?

benjeffery avatar Jul 24 '24 07:07 benjeffery

See the docs here for details on building python extension module: https://tskit.dev/tskit/docs/stable/development.html#id1

jeromekelleher avatar Jul 24 '24 08:07 jeromekelleher

Here's a quick sketch of another approach to doing this @Silloky:

import msprime

demography = msprime.Demography()
demography.add_population(name="A", initial_size=1_000)
demography.add_population(name="B", initial_size=1_000)
demography.add_population(name="C", initial_size=10_000)
demography.add_population_split(time=10000, derived=["A", "B"], ancestral="C")
ts = msprime.sim_ancestry(samples={"A": 2, "B": 3}, demography=demography, random_seed=12)
print(ts)
print(ts.draw_text())

for pop_id in range(ts.num_populations):
    pop_samples = ts.samples(population=pop_id)
    if len(pop_samples) == 0:
        continue
    for tree in ts.trees(tracked_samples=pop_samples):
        mrca = tree.mrca(*pop_samples) # Inefficient
        monophyletic = len(pop_samples) == tree.num_tracked_samples(mrca)
        print(pop_id, mrca, monophyletic)                                                           

Here, we're computing a true/false value for whether a population (or more generally any sample subset) is monophyletic in each tree. We do this using the num_tracked_samples, which is efficiently computed for each node in the C layer. The only inefficient bit here is how we're computing the MRCA of all the samples (but in most cases would be fine).

I think a good API for this would be a TreeSequence method that looked something like this:

def is_monophyletic(self, samples):
     """
     Returns a boolean array of length ts.num_trees which is True if the corresponding tree is monophyletic
     for the given set of samples. That is, if ``u`` is the MRCA of all the specified samples in a given tree, no 
     other samples are present in the subtree rooted at u.
     """
     # Code that looks like the above.

How does this look to you? If the API matches your needs, then we can go ahead with implementation and testing, and then think about how we can do the MRCA more efficiently (if we need to)?

jeromekelleher avatar Jul 24 '24 08:07 jeromekelleher

Hi, thank you very much for your answer. Indeed, this looks more promising and definitely more efficient. I don't really have time now to look into it sadly... but I will try to implement your suggestions in August. Have a nice day and a nice summer !

Silloky avatar Jul 28 '24 08:07 Silloky

Hi @Silloky! Are you still planning to work on this?

benjeffery avatar Sep 23 '24 10:09 benjeffery

Hi @benjeffery, Yes, I am ! I do realise it's been a while, but I've been quite busy... I plan to implement it in the next few weeks.

Silloky avatar Sep 29 '24 08:09 Silloky

Closing this due to inactivity - please feel free to reopen if you'd like to pick it back up @Silloky

jeromekelleher avatar Apr 01 '25 10:04 jeromekelleher

Sure, haven't had times with all my exams... I'll work on this next summer.

Silloky avatar Apr 01 '25 11:04 Silloky