tskit extend path algorithm (extend

New extension (pun intended) to the extend_edges algorithm now called extend_paths (subject to change). We noticed that certain examples within unary regions would not be extended simply with extend edges. We propose this algorithm to handle examples like the following: For edge path $a \to b \to c$ in tree $T_1$ and edge $a \to d \to c$ in $T_2$, assuming $d\notin T_1$ and $b\notin T_2$ and $t(b)>t(d)$ we should have extended path $a\to b \to d \to c$ in tree $T_2$. Unit tests still need to be built, and the algorithm needs to be cleaned.

We hypothesize that using this alongside extend edges could be the optimal strategy for minimizing edges in tree sequences.

Apr 25 '24 22:04 hfr1tz3

Codecov Report

Attention: Patch coverage is 89.85507% with 35 lines in your changes missing coverage. Please review.

Project coverage is 89.72%. Comparing base (4170933) to head (659218e). Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
c/tskit/trees.c	89.73%	20 Missing and 15 partials :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2938      +/-   ##
==========================================
+ Coverage   89.71%   89.72%   +0.01%     
==========================================
  Files          29       29              
  Lines       31355    31567     +212     
  Branches     6077     6113      +36     
==========================================
+ Hits        28130    28324     +194     
- Misses       1840     1852      +12     
- Partials     1385     1391       +6

Flag	Coverage Δ
c-tests	`86.55% <89.73%> (+0.06%)`	:arrow_up:
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`89.01% <100.00%> (+0.01%)`	:arrow_up:
python-tests	`99.01% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
python/_tskitmodule.c	`89.01% <100.00%> (+0.01%)`	:arrow_up:
python/tskit/trees.py	`98.78% <100.00%> (ø)`
c/tskit/trees.c	`90.45% <89.73%> (-0.02%)`	:arrow_down:

... and 1 file with indirect coverage changes

Apr 25 '24 22:04 codecov[bot]

Proposal for the basis for a test: here is a definition of what should be extendable by this method; so we can check if there remain no extendable edges (when run until no further changes are made).

A right extendable path from $x$ to $y$ in a tree $T_k$ is a path $x <_k a_1 <_k \cdots <_k a_n <_k y$ (where $<_k$ denotes "below in tree $T_k$" that has the properties:

$x <{k+1} b_1 <{k+1} \cdots <{k+1} b_m <{k+1} y$ for some $b_i$ (the endpoints of the path are in the next tree)
for $1 < j < n$, $a_j$ is not in $T_{k+1}$, and for $1 < j < m$, $b_j$ is not in $T_{k}$ (none of the intermediate nodes from the two paths are in the other tree).

The nodes $a_i$ are right extendable, and the nodes $b_i$ are left extendable.

Why is this the right definition? Well, first it's clear we can't extend nodes that are in the next tree. Next, suppose that there is a $b_j$ on the path from $x$ to $y$ in $T_{k+1}$ that is also in $T_k$, but not on the path from $x$ to $y$. If $b_j <_k x$ then we can take $b_j$ as the other endpoint of the extendable path instead of $y$. If $b_j \nless_k x$, then that implies ancestry above $b_j$ was inherited along a different path back to the time of $x$, so we can't use inheritance from $x$ as a criteria for extending. A similar argument holds if $y <_k b_j$ or and $y \nless_k b_j$.

May 15 '24 16:05 petrelharp

I've added two things:

a simple check based on the previous comment for whether a ts is no longer extendable; so if our algorithm terminates before max_iter, then that test should agree; and
an implementation of extend paths that just rebuilds the whole tree sequence after every tree, so is a lot simpler. (I called it "naive", but it's not actually simple enough to be 'naive'; I had to deal with quite a few fiddly things for it.) It's real slow, so can't be used on anything at all big.

Next step: compare it to the non-naive version.

Jun 01 '24 15:06 petrelharp

This is hopefully ready for you to look at, @nspope!

One question is that currently max_iter is required; but, should we provide a default? Should we do a warning if it hasn't terminated (probably)?

Sep 05 '24 00:09 petrelharp

One question is that currently max_iter is required; but, should we provide a default? Should we do a warning if it hasn't terminated (probably)? I think that makes sense. I haven't checked if we each reach the max_iter value for the long sequences we look at, but I feel like it terminates before that value in most cases.

Sep 05 '24 17:09 hfr1tz3

Okay - now it's ready for reals, @nspope. The 'partial coverage' things that codecov flags are cases where we're like

if (A && B) { ... }

but then not all four combinations of A and B ever get seen. This let me identify one place that could be simplified, but I've looked at the others and I don't think any of the others can. They are getting hit by the python tests, and constructing C tests for all of them would be a Big Pain.

Sep 05 '24 22:09 petrelharp

Okay - @jeromekelleher, I think this is ready for a quick look (and then we can merge it! :tada:)

Sep 13 '24 20:09 petrelharp

Wait two more things:

provide some kind of feedback for max_iter
one remaining uncovered code bit

Sep 17 '24 13:09 petrelharp

Okay - now I've legit hit everything we can with tests (and simplified some things along the way). @jeromekelleher - could you just throw a quick eyeball at this?

Sep 17 '24 19:09 petrelharp

Thanks for the pointer pointers! I think this is ready to merge (assuming tests pass). Screenshot from 2024-09-17 22-08-27

Sep 18 '24 05:09 petrelharp

Excellent! I think you just need to cast to (int) for the printfs, give the commits a bit of a squash and it's ready to go in.

Sep 18 '24 09:09 jeromekelleher

Argh, yes - I just missed one cast. Ready now, hopefully!

Sep 18 '24 18:09 petrelharp

tskit
tskit copied to clipboard

extend path algorithm (extend_edgesV2)

Codecov Report

tskit tskit copied to clipboard

extend path algorithm (extend_edgesV2)

Codecov Report

tskit
tskit copied to clipboard