tskit
tskit copied to clipboard
extend path algorithm (extend_edgesV2)
New extension (pun intended) to the extend_edges algorithm now called extend_paths (subject to change). We noticed that certain examples within unary regions would not be extended simply with extend edges. We propose this algorithm to handle examples like the following:
For edge path $a \to b \to c$ in tree $T_1$ and edge $a \to d \to c$ in $T_2$, assuming $d\notin T_1$ and $b\notin T_2$ and $t(b)>t(d)$ we should have extended path $a\to b \to d \to c$ in tree $T_2$.
Unit tests still need to be built, and the algorithm needs to be cleaned.
We hypothesize that using this alongside extend edges could be the optimal strategy for minimizing edges in tree sequences.
Codecov Report
Attention: Patch coverage is 89.85507% with 35 lines in your changes missing coverage. Please review.
Project coverage is 89.72%. Comparing base (
4170933) to head (659218e). Report is 2 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| c/tskit/trees.c | 89.73% | 20 Missing and 15 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #2938 +/- ##
==========================================
+ Coverage 89.71% 89.72% +0.01%
==========================================
Files 29 29
Lines 31355 31567 +212
Branches 6077 6113 +36
==========================================
+ Hits 28130 28324 +194
- Misses 1840 1852 +12
- Partials 1385 1391 +6
| Flag | Coverage Δ | |
|---|---|---|
| c-tests | 86.55% <89.73%> (+0.06%) |
:arrow_up: |
| lwt-tests | 80.78% <ø> (ø) |
|
| python-c-tests | 89.01% <100.00%> (+0.01%) |
:arrow_up: |
| python-tests | 99.01% <100.00%> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files with missing lines | Coverage Δ | |
|---|---|---|
| python/_tskitmodule.c | 89.01% <100.00%> (+0.01%) |
:arrow_up: |
| python/tskit/trees.py | 98.78% <100.00%> (ø) |
|
| c/tskit/trees.c | 90.45% <89.73%> (-0.02%) |
:arrow_down: |
Proposal for the basis for a test: here is a definition of what should be extendable by this method; so we can check if there remain no extendable edges (when run until no further changes are made).
A right extendable path from $x$ to $y$ in a tree $T_k$ is a path $x <_k a_1 <_k \cdots <_k a_n <_k y$ (where $<_k$ denotes "below in tree $T_k$" that has the properties:
- $x <{k+1} b_1 <{k+1} \cdots <{k+1} b_m <{k+1} y$ for some $b_i$ (the endpoints of the path are in the next tree)
- for $1 < j < n$, $a_j$ is not in $T_{k+1}$, and for $1 < j < m$, $b_j$ is not in $T_{k}$ (none of the intermediate nodes from the two paths are in the other tree).
The nodes $a_i$ are right extendable, and the nodes $b_i$ are left extendable.
Why is this the right definition? Well, first it's clear we can't extend nodes that are in the next tree. Next, suppose that there is a $b_j$ on the path from $x$ to $y$ in $T_{k+1}$ that is also in $T_k$, but not on the path from $x$ to $y$. If $b_j <_k x$ then we can take $b_j$ as the other endpoint of the extendable path instead of $y$. If $b_j \nless_k x$, then that implies ancestry above $b_j$ was inherited along a different path back to the time of $x$, so we can't use inheritance from $x$ as a criteria for extending. A similar argument holds if $y <_k b_j$ or and $y \nless_k b_j$.
I've added two things:
- a simple check based on the previous comment for whether a ts is no longer extendable; so if our algorithm terminates before max_iter, then that test should agree; and
- an implementation of extend paths that just rebuilds the whole tree sequence after every tree, so is a lot simpler. (I called it "naive", but it's not actually simple enough to be 'naive'; I had to deal with quite a few fiddly things for it.) It's real slow, so can't be used on anything at all big.
Next step: compare it to the non-naive version.
This is hopefully ready for you to look at, @nspope!
One question is that currently max_iter is required; but, should we provide a default? Should we do a warning if it hasn't terminated (probably)?
One question is that currently
max_iteris required; but, should we provide a default? Should we do a warning if it hasn't terminated (probably)? I think that makes sense. I haven't checked if we each reach the max_iter value for the long sequences we look at, but I feel like it terminates before that value in most cases.
Okay - now it's ready for reals, @nspope. The 'partial coverage' things that codecov flags are cases where we're like
if (A && B) { ... }
but then not all four combinations of A and B ever get seen. This let me identify one place that could be simplified, but I've looked at the others and I don't think any of the others can. They are getting hit by the python tests, and constructing C tests for all of them would be a Big Pain.
Okay - @jeromekelleher, I think this is ready for a quick look (and then we can merge it! :tada:)
Wait two more things:
- provide some kind of feedback for
max_iter - one remaining uncovered code bit
Okay - now I've legit hit everything we can with tests (and simplified some things along the way). @jeromekelleher - could you just throw a quick eyeball at this?
Thanks for the pointer pointers! I think this is ready to merge (assuming tests pass).
Excellent! I think you just need to cast to (int) for the printfs, give the commits a bit of a squash and it's ready to go in.
Argh, yes - I just missed one cast. Ready now, hopefully!