See #2127. See also #1190 for when keep_unary_in_individuals was added.

This is a quick prototype to show how it can be done. Essentially we just need to make a second pass through the overlapping segments to see whether a node participates in any coalescences. The rest is refactoring.

Ideally we would like to use an enumeration like keep_unary="all" | "individuals" | "coalescent", but we need to be a bit careful about how this might interact with people using "truthy" inputs to keep_unary at the moment.

Here's an example:

When keep_unary = True:                                                                                               
3.86┊             24      ┊            24       ┊                     ┊ 
    ┊         ┏━━━━┻━━━━┓ ┊        ┏━━━━┻━━━━┓  ┊                     ┊ 
2.40┊        23         ┃ ┊       23         ┃  ┊            23       ┊ 
    ┊         ┃         ┃ ┊        ┃         ┃  ┊        ┏━━━━┻━━━━┓  ┊ 
1.32┊         ┃        21 ┊        ┃        21  ┊        ┃        22  ┊ 
    ┊         ┃         ┃ ┊        ┃         ┃  ┊        ┃         ┃  ┊ 
0.48┊        20         ┃ ┊       20         ┃  ┊       20         ┃  ┊ 
    ┊     ┏━━━┻━━━┓     ┃ ┊     ┏━━┻━━┓      ┃  ┊     ┏━━┻━━┓      ┃  ┊ 
0.38┊     ┃       ┃    19 ┊     ┃     ┃     19  ┊     ┃     ┃     19  ┊ 
    ┊     ┃       ┃     ┃ ┊     ┃     ┃     ┏┻┓ ┊     ┃     ┃     ┏┻┓ ┊ 
0.20┊    18       ┃     ┃ ┊    18     ┃     ┃ ┃ ┊    18     ┃     ┃ ┃ ┊ 
    ┊   ┏━┻━━┓    ┃     ┃ ┊   ┏━┻━┓   ┃     ┃ ┃ ┊   ┏━┻━┓   ┃     ┃ ┃ ┊ 
0.20┊  17    ┃    ┃     ┃ ┊  17   ┃   ┃     ┃ ┃ ┊  17   ┃   ┃     ┃ ┃ ┊ 
    ┊  ┏┻━┓  ┃    ┃     ┃ ┊  ┏┻━┓ ┃   ┃     ┃ ┃ ┊  ┏┻━┓ ┃   ┃     ┃ ┃ ┊ 
0.17┊  ┃  ┃ 16    ┃     ┃ ┊  ┃  ┃16   ┃     ┃ ┃ ┊  ┃  ┃16   ┃     ┃ ┃ ┊ 
    ┊  ┃  ┃ ┏┻┓   ┃     ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊ 
0.11┊ 15  ┃ ┃ ┃   ┃     ┃ ┊ 15  ┃ ┃   ┃     ┃ ┃ ┊ 15  ┃ ┃   ┃     ┃ ┃ ┊ 
    ┊ ┏┻┓ ┃ ┃ ┃   ┃     ┃ ┊ ┏┻┓ ┃ ┃   ┃     ┃ ┃ ┊ ┏┻┓ ┃ ┃   ┃     ┃ ┃ ┊ 
0.08┊ ┃ ┃ ┃ ┃ ┃  14     ┃ ┊ ┃ ┃ ┃ ┃  14     ┃ ┃ ┊ ┃ ┃ ┃ ┃  14     ┃ ┃ ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┃ ┊ 
0.07┊ ┃ ┃ ┃ ┃ ┃ ┃  13   ┃ ┊ ┃ ┃ ┃ ┃ ┃  13   ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃  13   ┃ ┃ ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┃ ┊ 
0.06┊ ┃ ┃ ┃ ┃ ┃ ┃ 12  ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 12  ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 12  ┃ ┃ ┃ ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ 
0.06┊ ┃ ┃ ┃10 ┃ ┃ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃11 ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃11 ┃ ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┊ 
0.00┊ 0 4 1 2 7 5 6 9 8 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊ 
    0                     2                     4                    10 
                                                                                                                      
                                                                                                                      
When keep_unary_if_coalescent = True                                                                                  
3.86┊             20      ┊            20       ┊                     ┊ 
    ┊         ┏━━━━┻━━━━┓ ┊        ┏━━━━┻━━━━┓  ┊                     ┊ 
2.40┊        19         ┃ ┊       19         ┃  ┊            19       ┊ 
    ┊         ┃         ┃ ┊        ┃         ┃  ┊        ┏━━━━┻━━━━┓  ┊ 
0.48┊        18         ┃ ┊       18         ┃  ┊       18         ┃  ┊ 
    ┊     ┏━━━┻━━━┓     ┃ ┊     ┏━━┻━━┓      ┃  ┊     ┏━━┻━━┓      ┃  ┊ 
0.38┊     ┃       ┃    17 ┊     ┃     ┃     17  ┊     ┃     ┃     17  ┊ 
    ┊     ┃       ┃     ┃ ┊     ┃     ┃     ┏┻┓ ┊     ┃     ┃     ┏┻┓ ┊ 
0.20┊    16       ┃     ┃ ┊    16     ┃     ┃ ┃ ┊    16     ┃     ┃ ┃ ┊ 
    ┊   ┏━┻━━┓    ┃     ┃ ┊   ┏━┻━┓   ┃     ┃ ┃ ┊   ┏━┻━┓   ┃     ┃ ┃ ┊ 
0.20┊  15    ┃    ┃     ┃ ┊  15   ┃   ┃     ┃ ┃ ┊  15   ┃   ┃     ┃ ┃ ┊ 
    ┊  ┏┻━┓  ┃    ┃     ┃ ┊  ┏┻━┓ ┃   ┃     ┃ ┃ ┊  ┏┻━┓ ┃   ┃     ┃ ┃ ┊ 
0.17┊  ┃  ┃ 14    ┃     ┃ ┊  ┃  ┃14   ┃     ┃ ┃ ┊  ┃  ┃14   ┃     ┃ ┃ ┊ 
    ┊  ┃  ┃ ┏┻┓   ┃     ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊ 
0.11┊ 13  ┃ ┃ ┃   ┃     ┃ ┊ 13  ┃ ┃   ┃     ┃ ┃ ┊ 13  ┃ ┃   ┃     ┃ ┃ ┊ 
    ┊ ┏┻┓ ┃ ┃ ┃   ┃     ┃ ┊ ┏┻┓ ┃ ┃   ┃     ┃ ┃ ┊ ┏┻┓ ┃ ┃   ┃     ┃ ┃ ┊ 
0.08┊ ┃ ┃ ┃ ┃ ┃  12     ┃ ┊ ┃ ┃ ┃ ┃  12     ┃ ┃ ┊ ┃ ┃ ┃ ┃  12     ┃ ┃ ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┃ ┊ 
0.07┊ ┃ ┃ ┃ ┃ ┃ ┃  11   ┃ ┊ ┃ ┃ ┃ ┃ ┃  11   ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃  11   ┃ ┃ ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┃ ┊ 
0.06┊ ┃ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┃ ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ 
0.00┊ 0 4 1 2 7 5 6 9 8 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊ 
    0                     2                     4                    10 


When keep_unary = False                                    
3.86┊             20      ┊            20       ┊                     ┊                                                
    ┊         ┏━━━━┻━━━━┓ ┊        ┏━━━━┻━━━━┓  ┊                     ┊                                                
2.40┊         ┃         ┃ ┊        ┃         ┃  ┊            19       ┊                                                
    ┊         ┃         ┃ ┊        ┃         ┃  ┊        ┏━━━━┻━━━━┓  ┊                                                
0.48┊        18         ┃ ┊       18         ┃  ┊       18         ┃  ┊                                                
    ┊     ┏━━━┻━━━┓     ┃ ┊     ┏━━┻━━┓      ┃  ┊     ┏━━┻━━┓      ┃  ┊                                                
0.38┊     ┃       ┃     ┃ ┊     ┃     ┃     17  ┊     ┃     ┃     17  ┊                                                
    ┊     ┃       ┃     ┃ ┊     ┃     ┃     ┏┻┓ ┊     ┃     ┃     ┏┻┓ ┊                                                
0.20┊    16       ┃     ┃ ┊    16     ┃     ┃ ┃ ┊    16     ┃     ┃ ┃ ┊                                                
    ┊   ┏━┻━━┓    ┃     ┃ ┊   ┏━┻━┓   ┃     ┃ ┃ ┊   ┏━┻━┓   ┃     ┃ ┃ ┊                                                
0.20┊  15    ┃    ┃     ┃ ┊  15   ┃   ┃     ┃ ┃ ┊  15   ┃   ┃     ┃ ┃ ┊                                                
    ┊  ┏┻━┓  ┃    ┃     ┃ ┊  ┏┻━┓ ┃   ┃     ┃ ┃ ┊  ┏┻━┓ ┃   ┃     ┃ ┃ ┊                                                
0.17┊  ┃  ┃ 14    ┃     ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊                                                
    ┊  ┃  ┃ ┏┻┓   ┃     ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊  ┃  ┃ ┃   ┃     ┃ ┃ ┊                                                
0.11┊ 13  ┃ ┃ ┃   ┃     ┃ ┊ 13  ┃ ┃   ┃     ┃ ┃ ┊ 13  ┃ ┃   ┃     ┃ ┃ ┊                                                
    ┊ ┏┻┓ ┃ ┃ ┃   ┃     ┃ ┊ ┏┻┓ ┃ ┃   ┃     ┃ ┃ ┊ ┏┻┓ ┃ ┃   ┃     ┃ ┃ ┊                                                
0.08┊ ┃ ┃ ┃ ┃ ┃  12     ┃ ┊ ┃ ┃ ┃ ┃  12     ┃ ┃ ┊ ┃ ┃ ┃ ┃  12     ┃ ┃ ┊                                                
    ┊ ┃ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓   ┃ ┃ ┊                                                
0.07┊ ┃ ┃ ┃ ┃ ┃ ┃  11   ┃ ┊ ┃ ┃ ┃ ┃ ┃  11   ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃  11   ┃ ┃ ┊                                                
    ┊ ┃ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┏┻━┓ ┃ ┃ ┊                                                
0.06┊ ┃ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┃ ┊                                                
    ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊                                                
0.00┊ 0 4 1 2 7 5 6 9 8 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊                                                
    0                     2                     4                    10

Jul 04 '22 18:07 jeromekelleher

Codecov Report

Merging #2381 (daae563) into main (06e3fb0) will increase coverage by 0.00%. The diff coverage is 96.29%.

@@           Coverage Diff           @@
##             main    #2381   +/-   ##
=======================================
  Coverage   93.33%   93.34%           
=======================================
  Files          28       28           
  Lines       26981    26997   +16     
  Branches     1246     1247    +1     
=======================================
+ Hits        25184    25201   +17     
+ Misses       1763     1762    -1     
  Partials       34       34

Flag	Coverage Δ
c-tests	`92.26% <95.00%> (-0.01%)`	:arrow_down:
lwt-tests	`89.05% <ø> (ø)`
python-c-tests	`71.00% <85.71%> (+0.01%)`	:arrow_up:
python-tests	`98.93% <100.00%> (+<0.01%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/tskit/trees.py	`98.66% <ø> (ø)`
c/tskit/tables.c	`90.25% <95.00%> (+<0.01%)`	:arrow_up:
python/_tskitmodule.c	`90.70% <100.00%> (+0.03%)`	:arrow_up:
python/tskit/tables.py	`98.96% <100.00%> (+<0.01%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 06e3fb0...daae563. Read the comment docs.

Jul 04 '22 19:07 codecov[bot]

If I've got it right, this is retaining all ancestry for any nodes that are coalescent nodes anywhere in the sequence. This is slightly different to what extend_edges does, which is to output any input edges on which somewhere there is a coalescence. So, we could have a node that is a coalscent node on [0, 10], and then we output an edge for which it's a parent on [0, 15], but not a different edge for which it's a parent on [50, 60]. The argument for doing the second thing but not the first is that (a) it's not clear if we have any information to infer the existence of these totally-unary edges; and (b) these totally unary edges only increase the size of the tree sequence (as opposed to the partially-coalescent ones, which reduce it).

Thoughts?

Jul 04 '22 20:07 petrelharp

Just to get this right in my head, you mean that if the node has non-contiguous edges (i.e. "trapped material") where it exists in the topology from (say) 10-15 and 50-60, but is not on any lineage, even as a unary node, from 15-50, then the current implementation keeps the 50-60 edge too?

If so, I think that's right, and the implementation here is not unreasonable behaviour. For example you may want to keep the number of nodes unaffected, but preserve as much of the known ancestry as possible. Deleting these regions loses potentially useful ancestral information. It would be interesting to see if nodes like this are inferred by tsinfer. I'll check now, but I suspect there are some cases where they will be. I can see the rationale for sometimes wanting to deleting them too, though.

I suspect that you are right that they only increase the size of the tree sequence, but I'm not 100% convinced that this is always the case.

Jul 04 '22 20:07 hyanwong

Following up on my comment above. we do have "split" edges like this in tsinfer (we also have quite a lot of nodes that are unary everywhere too). Here's an example. it may be worth putting something like this in the repo at https://github.com/petrelharp/num_edges

import collections

import numpy as np
import msprime
import tskit
import tsinfer

num_samples = 8
ts  = msprime.sim_ancestry(
    num_samples,
    ploidy=1,
    sequence_length=10e8,
    recombination_rate=1e-8,
    record_full_arg=True,
    random_seed=222
)

tsA = msprime.sim_mutations(ts, rate=1e-8, random_seed=123)
print(tsA.num_sites, "sites")

tsB = tsinfer.infer(tsinfer.SampleData.from_tree_sequence(tsA))
print(tsB.num_trees, "inferred trees")

def node_spans_max_children(ts):
    node_spans = collections.defaultdict(list)
    # node_id => [(left, right, [n_children1, n_children2,...]), ()]
    curr_parents = collections.defaultdict(set)

    for tree, diffs in zip(ts.trees(), ts.edge_diffs()):
        for e_in in diffs.edges_in:
            u = e_in.parent
            if len(curr_parents[u]) == 0:
                # node starts
                node_spans[u].append(
                    [diffs.interval.left, diffs.interval.right, tree.num_children(u)]
                )
            else:
                node_spans[u][-1][2] = max(node_spans[u][-1][2], tree.num_children(u))
            curr_parents[e_in.parent].add(e_in.id)
        for e_out in diffs.edges_out:
            u = e_out.parent
            curr_parents[u].remove(e_out.id)
            if len(curr_parents[u]) == 0:
                # node ends
                node_spans[u][-1][1] = diffs.interval.right


    for u, data in node_spans.items():
        max_children = max(contiguous[2] for contiguous in data)
        for contiguous in data:
            if max_children > 1 and contiguous[2] < 2:
                print("Node", u, "is contiguously unary from", contiguous)

node_spans_max_children(tsB)

43 sites
16 inferred trees
Node 16 is contiguously unary from [548658128.0, 993319578.0, 1]

tsB.draw_svg(time_scale="rank", style=".n16 > .sym {fill: red} .mut {display: none}")

unknown

Jul 04 '22 21:07 hyanwong

Ah, great, nice example. I do agree that sometimes it'd be conceptually nice to have those nodes in their entireity, but the two operations are in principle different, and we should figure out whether we want one or both of them. If the unconnected, unary-only segments of a coalescent nodes are also inferrable, then we should also include them! But, I am not sure that they are? Do you have a good reason that they might be?

So, one question is: are these unary-only spans real or artifactual? Like, when tsinfer infers a unary-only bit, how often is that unary-only bit "correct"? In the case you've got here, the extra unary bit does not seem to be correct, at least it isn't a parent to the same samples. On my machine (tsinfer 0.2.3) the node in question is 14, not 16, but:

t = tsA.first()
orig_node = t.mrca(1, 3, 7)
t = tsB.first()
new_node = t.mrca(1, 3, 7)
assert new_node == 14

left = 0
sA = set()
sB = set()
for interval, tA, tB in tsA.coiterate(tsB):
    new_sA = set(tA.samples(orig_node))
    new_sB = set(tB.samples(new_node))
    if sA != new_sA or sB != new_sB:
        print(f"{left} to {interval[0]} the node is a parent to:")
        print(f"         truth: {sA}")
        print(f"      inferred: {sB}")
    left = interval[0]
    sA = new_sA
    sB = new_sB

gives

0 to 0 the node is a parent to:
         truth: set()
      inferred: set()
147684860.0 to 160570091.0 the node is a parent to:
         truth: {1, 3, 7}
      inferred: {1, 3, 7}
169968287.0 to 173906518.0 the node is a parent to:
         truth: {3, 7}
      inferred: {1, 3, 7}
173906518.0 to 178484736.0 the node is a parent to:
         truth: {3, 7}
      inferred: {1, 3, 5, 6, 7}
178484736.0 to 204789241.0 the node is a parent to:
         truth: {3, 7}
      inferred: {3, 5, 7}
269704499.0 to 320784505.0 the node is a parent to:
         truth: {3, 7}
      inferred: set()
480931907.0 to 488755800.0 the node is a parent to:
         truth: {3, 5, 7}
      inferred: set()
540120992.0 to 548658128.0 the node is a parent to:
         truth: {0, 3, 5, 7}
      inferred: set()
564756120.0 to 576356796.0 the node is a parent to:
         truth: {0, 3, 5, 7}
      inferred: {3}
576356796.0 to 577972638.0 the node is a parent to:
         truth: {0, 5, 7}
      inferred: {3}
861036781.0 to 884623975.0 the node is a parent to:
         truth: {0, 5, 7}
      inferred: {3, 6}
959461190.0 to 983144527.0 the node is a parent to:
         truth: {0, 5, 7}
      inferred: set()

Have you got any situations where there's clearly information to infer those unary-only bits?

Jul 04 '22 23:07 petrelharp

Moving this to https://github.com/petrelharp/num_edges/issues/2 so the discussion can continue easily even when the PR is closed.

Jul 05 '22 07:07 hyanwong

I think what simplify is doing here is a reasonable operation, because it's not discarding any information. Given a full ARG, it keeps a full record of any topology for ancestors that are coalescent anywhere. This is a useful operation, and I think corresponds to what msprime may optionally output some day.

Implementing the exact mirror of edge_extend in simplify would be tricky I think, and much more complicated than this because you'd have to do a lot of reasoning about what's adjacent to what, and exclude edges depending on what they are adjacent to. I'm not enthusiastic about trying to do this - partly because it would be hard, but mostly because I don't see a lot of point (other than to provide the mirror-image of edge_extend). In practise the increase in size caused by extra unary-only edges will be negligible, I think, and not worth the substantially more complicated simplify algorithm (from say, the forward simulation perspective).

I guess we could do some experiments easily enough to see how significant the difference is by counting the number of unary-only edges output by this for a range of input parameters?

Jul 05 '22 08:07 jeromekelleher

I agree 100% with Jerome here. The operation as currently defined is simple and easy to implement and explain. The one involving adjacency, although theoretically interesting, is much more complicated as a simplification method IMO.

As an example of an additional complication due to adjacency, what happens when all the current edges vanish and are replaced with a single different edge with the same parent. This would appear to be adjacent, but I'm not sure if it would be extended using @petrelharp's algorithm. The simplify method here can ignore all that stuff.

Jul 05 '22 08:07 hyanwong

Sounds good - let's do some experiments and see what the difference is. I do worry though that the result will be harder to explain, though: to the question "why's that stuff in the tree sequence" it's less confusing to say "it's only what you can learn from the trees" than if we add "plus a bit more stuff that made the algorithm easier".

Jul 05 '22 14:07 petrelharp

I've always seen edge_extend as being a method to infer a subset of the true ancestry (i.e., what's returned by this method), which is why I said it would be exact in the SMC case and an approximation for things like full Hudson and gene conversion. I would then put it the other way around - we can infer a subset (~the vast majority?) of the true unary nodes using these properties you're exploiting.

I think it's a pretty strong assertion to say that the current edge_extend algorithm is the best possible and no extra information about unary nodes could ever be derived from the trees.

Jul 05 '22 14:07 jeromekelleher

This LGTM @jeromekelleher: it's nice it is relatively simple. I agree that a separate parameter is the way to go, in case people want to keep both unary in individuals and when coalescent (I don't think we test for combinations of the keep_unary_XXX params, do we?)

I assume that this would still remove a node which is unary in some places, and coalescent in others, but where the coalescent regions do not lead to more than descendant sample (because one of the lineages down from a child is a piece of "hanging" topology that does not lead to a sample). Should we test this too?

The only other comment I would have is to think about the name of the parameter. To someone who hasn't thought about it a lot, keep unary if coalescent could possibly seem like a contradiction in terms. It's more like keep_unary_if_coalescent_elsewhere, but that's too long. We could have keep_partially_unary, although it's nice to have all the params starting with keep_unary_XXXXX, which I presume is why we hit upon the current naming. It's probably fine as it is, if documented well.

Jul 21 '22 09:07 hyanwong

The only other comment I would have is to think about the name of the parameter.

Another possible name: keep_unary_regions (to emphasise that it's not that we are keeping unary nodes per se, but the unary regions of nodes that we would have kept anyway).

Jul 21 '22 10:07 hyanwong

Here's a possible docstring. It's a bit contorted, but the best I could do after a fair bit of rewording.

If True, keep unary regions of retained nodes. After simplification, unary nodes may continue to exist in local trees, but only if those nodes would have been retained by the simplification process anyway (e.g. because they represent coalescences between two or more lineages from sample nodes to the root). If False, these nodes will still be present in the tree sequence, but are subject to removal from lineages where they are not coalescent points. Default: False

Jul 21 '22 10:07 hyanwong

I think it's a pretty strong assertion to say that the current edge_extend algorithm is the best possible and no extra information about unary nodes could ever be derived from the trees.

Totally agree - I wasn't trying to say that it was!

How about

If True, keep all regions over which retained nodes are ancestral to samples. By default, nodes not in samples are only retained on regions of the genome where they are coalescent (i.e., a MRCA to at least two nodes in samples); if this option is True then nodes not in samples are still only retained if they are a MRCA to at least two nodes in samples, but all regions of the genome on which they are ancestral to a node in sample is also retained. Any extra retained information will show up as unary nodes on local trees. Default: False

Jul 21 '22 14:07 petrelharp

OK - I did some quick experiments. This keep_unary argument seems to usually increase the number of edges by about as much as extend_edges decreases it. It's very interesting how much of this stuff there is! Screenshot from 2022-07-21 21-39-00

So - this operation might still be useful, but I was wanting to include it in SLiM, to produce smaller, faster files. It looks like this operation won't do that. I'm not sure about the bigger picture at the moment.

Jul 22 '22 04:07 petrelharp

Huh, isn't that interesting... I wonder if that's still true if you have a larger sample size? 6 is small enough that I can imagine odd things happening.

Jul 22 '22 09:07 jeromekelleher

I thought I had some code and plots that showed that the hacked version generally reduced the number of edges. It could have been with a bigger sample size: I can't remember.

Jul 22 '22 11:07 hyanwong

Hah, you're right about the number of samples: Screenshot from 2022-07-22 08-25-25

... why would this happen? (i.e., why would more samples make the proportion of edges that are isolated unary edges on coalescent nodes smaller?)

Jul 22 '22 15:07 petrelharp

... why would this happen? (i.e., why would more samples make the proportion of edges that are isolated unary edges on coalescent nodes smaller?)

Not obvious to me...

Jul 22 '22 15:07 jeromekelleher

I wonder how this will change with a longer genome, though...

Jul 22 '22 15:07 petrelharp

Ack, here's results for a longer sequence (0.1M instead of 0.01M): Screenshot from 2022-07-22 13-13-30

Jul 22 '22 20:07 petrelharp

Here's a bit more of an experiment changing both the number of samples and sequence length; recomb rate is 1e-8. Left is 10 samples; center is 100 samples; right is 1000. Screenshot from 2022-07-22 23-17-34

Jul 23 '22 06:07 petrelharp

Very interesting! I really didn't see this coming - the plot makes a clear practical case for edge_extend.

I guess we do need to implement it in simplify to be truly useful for forward sims. I think the present operation is useful also, but we'd need some new names I think.

Maybe we should move away from the phrase keep_unary entirely? I'm finding it less helpful every time I look at this stuff.

Jul 25 '22 13:07 jeromekelleher

tskit
tskit copied to clipboard

Keep unary for coalescent nodes in simplify

Codecov Report

tskit tskit copied to clipboard

Keep unary for coalescent nodes in simplify

Codecov Report

tskit
tskit copied to clipboard