tskit
tskit copied to clipboard
Keep unary for coalescent nodes in simplify
See #2127. See also #1190 for when keep_unary_in_individuals was added.
This is a quick prototype to show how it can be done. Essentially we just need to make a second pass through the overlapping segments to see whether a node participates in any coalescences. The rest is refactoring.
Ideally we would like to use an enumeration like keep_unary="all" | "individuals" | "coalescent", but we need to be a bit careful about how this might interact with people using "truthy" inputs to keep_unary at the moment.
Here's an example:
When keep_unary = True:
3.86┊ 24 ┊ 24 ┊ ┊
┊ ┏━━━━┻━━━━┓ ┊ ┏━━━━┻━━━━┓ ┊ ┊
2.40┊ 23 ┃ ┊ 23 ┃ ┊ 23 ┊
┊ ┃ ┃ ┊ ┃ ┃ ┊ ┏━━━━┻━━━━┓ ┊
1.32┊ ┃ 21 ┊ ┃ 21 ┊ ┃ 22 ┊
┊ ┃ ┃ ┊ ┃ ┃ ┊ ┃ ┃ ┊
0.48┊ 20 ┃ ┊ 20 ┃ ┊ 20 ┃ ┊
┊ ┏━━━┻━━━┓ ┃ ┊ ┏━━┻━━┓ ┃ ┊ ┏━━┻━━┓ ┃ ┊
0.38┊ ┃ ┃ 19 ┊ ┃ ┃ 19 ┊ ┃ ┃ 19 ┊
┊ ┃ ┃ ┃ ┊ ┃ ┃ ┏┻┓ ┊ ┃ ┃ ┏┻┓ ┊
0.20┊ 18 ┃ ┃ ┊ 18 ┃ ┃ ┃ ┊ 18 ┃ ┃ ┃ ┊
┊ ┏━┻━━┓ ┃ ┃ ┊ ┏━┻━┓ ┃ ┃ ┃ ┊ ┏━┻━┓ ┃ ┃ ┃ ┊
0.20┊ 17 ┃ ┃ ┃ ┊ 17 ┃ ┃ ┃ ┃ ┊ 17 ┃ ┃ ┃ ┃ ┊
┊ ┏┻━┓ ┃ ┃ ┃ ┊ ┏┻━┓ ┃ ┃ ┃ ┃ ┊ ┏┻━┓ ┃ ┃ ┃ ┃ ┊
0.17┊ ┃ ┃ 16 ┃ ┃ ┊ ┃ ┃16 ┃ ┃ ┃ ┊ ┃ ┃16 ┃ ┃ ┃ ┊
┊ ┃ ┃ ┏┻┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┊
0.11┊ 15 ┃ ┃ ┃ ┃ ┃ ┊ 15 ┃ ┃ ┃ ┃ ┃ ┊ 15 ┃ ┃ ┃ ┃ ┃ ┊
┊ ┏┻┓ ┃ ┃ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┃ ┊
0.08┊ ┃ ┃ ┃ ┃ ┃ 14 ┃ ┊ ┃ ┃ ┃ ┃ 14 ┃ ┃ ┊ ┃ ┃ ┃ ┃ 14 ┃ ┃ ┊
┊ ┃ ┃ ┃ ┃ ┃ ┏━┻━┓ ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓ ┃ ┃ ┊
0.07┊ ┃ ┃ ┃ ┃ ┃ ┃ 13 ┃ ┊ ┃ ┃ ┃ ┃ ┃ 13 ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 13 ┃ ┃ ┊
┊ ┃ ┃ ┃ ┃ ┃ ┃ ┏┻━┓ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻━┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻━┓ ┃ ┃ ┊
0.06┊ ┃ ┃ ┃ ┃ ┃ ┃ 12 ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 12 ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 12 ┃ ┃ ┃ ┊
┊ ┃ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊
0.06┊ ┃ ┃ ┃10 ┃ ┃ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃11 ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃11 ┃ ┊
┊ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┊
0.00┊ 0 4 1 2 7 5 6 9 8 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊
0 2 4 10
When keep_unary_if_coalescent = True
3.86┊ 20 ┊ 20 ┊ ┊
┊ ┏━━━━┻━━━━┓ ┊ ┏━━━━┻━━━━┓ ┊ ┊
2.40┊ 19 ┃ ┊ 19 ┃ ┊ 19 ┊
┊ ┃ ┃ ┊ ┃ ┃ ┊ ┏━━━━┻━━━━┓ ┊
0.48┊ 18 ┃ ┊ 18 ┃ ┊ 18 ┃ ┊
┊ ┏━━━┻━━━┓ ┃ ┊ ┏━━┻━━┓ ┃ ┊ ┏━━┻━━┓ ┃ ┊
0.38┊ ┃ ┃ 17 ┊ ┃ ┃ 17 ┊ ┃ ┃ 17 ┊
┊ ┃ ┃ ┃ ┊ ┃ ┃ ┏┻┓ ┊ ┃ ┃ ┏┻┓ ┊
0.20┊ 16 ┃ ┃ ┊ 16 ┃ ┃ ┃ ┊ 16 ┃ ┃ ┃ ┊
┊ ┏━┻━━┓ ┃ ┃ ┊ ┏━┻━┓ ┃ ┃ ┃ ┊ ┏━┻━┓ ┃ ┃ ┃ ┊
0.20┊ 15 ┃ ┃ ┃ ┊ 15 ┃ ┃ ┃ ┃ ┊ 15 ┃ ┃ ┃ ┃ ┊
┊ ┏┻━┓ ┃ ┃ ┃ ┊ ┏┻━┓ ┃ ┃ ┃ ┃ ┊ ┏┻━┓ ┃ ┃ ┃ ┃ ┊
0.17┊ ┃ ┃ 14 ┃ ┃ ┊ ┃ ┃14 ┃ ┃ ┃ ┊ ┃ ┃14 ┃ ┃ ┃ ┊
┊ ┃ ┃ ┏┻┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┊
0.11┊ 13 ┃ ┃ ┃ ┃ ┃ ┊ 13 ┃ ┃ ┃ ┃ ┃ ┊ 13 ┃ ┃ ┃ ┃ ┃ ┊
┊ ┏┻┓ ┃ ┃ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┃ ┊
0.08┊ ┃ ┃ ┃ ┃ ┃ 12 ┃ ┊ ┃ ┃ ┃ ┃ 12 ┃ ┃ ┊ ┃ ┃ ┃ ┃ 12 ┃ ┃ ┊
┊ ┃ ┃ ┃ ┃ ┃ ┏━┻━┓ ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓ ┃ ┃ ┊
0.07┊ ┃ ┃ ┃ ┃ ┃ ┃ 11 ┃ ┊ ┃ ┃ ┃ ┃ ┃ 11 ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 11 ┃ ┃ ┊
┊ ┃ ┃ ┃ ┃ ┃ ┃ ┏┻━┓ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻━┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻━┓ ┃ ┃ ┊
0.06┊ ┃ ┃ ┃ ┃ ┃ ┃ 10 ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 10 ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 10 ┃ ┃ ┃ ┊
┊ ┃ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊
0.00┊ 0 4 1 2 7 5 6 9 8 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊
0 2 4 10
When keep_unary = False
3.86┊ 20 ┊ 20 ┊ ┊
┊ ┏━━━━┻━━━━┓ ┊ ┏━━━━┻━━━━┓ ┊ ┊
2.40┊ ┃ ┃ ┊ ┃ ┃ ┊ 19 ┊
┊ ┃ ┃ ┊ ┃ ┃ ┊ ┏━━━━┻━━━━┓ ┊
0.48┊ 18 ┃ ┊ 18 ┃ ┊ 18 ┃ ┊
┊ ┏━━━┻━━━┓ ┃ ┊ ┏━━┻━━┓ ┃ ┊ ┏━━┻━━┓ ┃ ┊
0.38┊ ┃ ┃ ┃ ┊ ┃ ┃ 17 ┊ ┃ ┃ 17 ┊
┊ ┃ ┃ ┃ ┊ ┃ ┃ ┏┻┓ ┊ ┃ ┃ ┏┻┓ ┊
0.20┊ 16 ┃ ┃ ┊ 16 ┃ ┃ ┃ ┊ 16 ┃ ┃ ┃ ┊
┊ ┏━┻━━┓ ┃ ┃ ┊ ┏━┻━┓ ┃ ┃ ┃ ┊ ┏━┻━┓ ┃ ┃ ┃ ┊
0.20┊ 15 ┃ ┃ ┃ ┊ 15 ┃ ┃ ┃ ┃ ┊ 15 ┃ ┃ ┃ ┃ ┊
┊ ┏┻━┓ ┃ ┃ ┃ ┊ ┏┻━┓ ┃ ┃ ┃ ┃ ┊ ┏┻━┓ ┃ ┃ ┃ ┃ ┊
0.17┊ ┃ ┃ 14 ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┊
┊ ┃ ┃ ┏┻┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┃ ┊
0.11┊ 13 ┃ ┃ ┃ ┃ ┃ ┊ 13 ┃ ┃ ┃ ┃ ┃ ┊ 13 ┃ ┃ ┃ ┃ ┃ ┊
┊ ┏┻┓ ┃ ┃ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┃ ┊
0.08┊ ┃ ┃ ┃ ┃ ┃ 12 ┃ ┊ ┃ ┃ ┃ ┃ 12 ┃ ┃ ┊ ┃ ┃ ┃ ┃ 12 ┃ ┃ ┊
┊ ┃ ┃ ┃ ┃ ┃ ┏━┻━┓ ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┏━┻━┓ ┃ ┃ ┊
0.07┊ ┃ ┃ ┃ ┃ ┃ ┃ 11 ┃ ┊ ┃ ┃ ┃ ┃ ┃ 11 ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 11 ┃ ┃ ┊
┊ ┃ ┃ ┃ ┃ ┃ ┃ ┏┻━┓ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻━┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻━┓ ┃ ┃ ┊
0.06┊ ┃ ┃ ┃ ┃ ┃ ┃ 10 ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 10 ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 10 ┃ ┃ ┃ ┊
┊ ┃ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊
0.00┊ 0 4 1 2 7 5 6 9 8 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊ 0 4 1 7 5 6 9 8 2 3 ┊
0 2 4 10
Codecov Report
Merging #2381 (daae563) into main (06e3fb0) will increase coverage by
0.00%. The diff coverage is96.29%.
@@ Coverage Diff @@
## main #2381 +/- ##
=======================================
Coverage 93.33% 93.34%
=======================================
Files 28 28
Lines 26981 26997 +16
Branches 1246 1247 +1
=======================================
+ Hits 25184 25201 +17
+ Misses 1763 1762 -1
Partials 34 34
| Flag | Coverage Δ | |
|---|---|---|
| c-tests | 92.26% <95.00%> (-0.01%) |
:arrow_down: |
| lwt-tests | 89.05% <ø> (ø) |
|
| python-c-tests | 71.00% <85.71%> (+0.01%) |
:arrow_up: |
| python-tests | 98.93% <100.00%> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| python/tskit/trees.py | 98.66% <ø> (ø) |
|
| c/tskit/tables.c | 90.25% <95.00%> (+<0.01%) |
:arrow_up: |
| python/_tskitmodule.c | 90.70% <100.00%> (+0.03%) |
:arrow_up: |
| python/tskit/tables.py | 98.96% <100.00%> (+<0.01%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 06e3fb0...daae563. Read the comment docs.
If I've got it right, this is retaining all ancestry for any nodes that are coalescent nodes anywhere in the sequence. This is slightly different to what extend_edges does, which is to output any input edges on which somewhere there is a coalescence. So, we could have a node that is a coalscent node on [0, 10], and then we output an edge for which it's a parent on [0, 15], but not a different edge for which it's a parent on [50, 60]. The argument for doing the second thing but not the first is that (a) it's not clear if we have any information to infer the existence of these totally-unary edges; and (b) these totally unary edges only increase the size of the tree sequence (as opposed to the partially-coalescent ones, which reduce it).
Thoughts?
Just to get this right in my head, you mean that if the node has non-contiguous edges (i.e. "trapped material") where it exists in the topology from (say) 10-15 and 50-60, but is not on any lineage, even as a unary node, from 15-50, then the current implementation keeps the 50-60 edge too?
If so, I think that's right, and the implementation here is not unreasonable behaviour. For example you may want to keep the number of nodes unaffected, but preserve as much of the known ancestry as possible. Deleting these regions loses potentially useful ancestral information. It would be interesting to see if nodes like this are inferred by tsinfer. I'll check now, but I suspect there are some cases where they will be. I can see the rationale for sometimes wanting to deleting them too, though.
I suspect that you are right that they only increase the size of the tree sequence, but I'm not 100% convinced that this is always the case.
Following up on my comment above. we do have "split" edges like this in tsinfer (we also have quite a lot of nodes that are unary everywhere too). Here's an example. it may be worth putting something like this in the repo at https://github.com/petrelharp/num_edges
import collections
import numpy as np
import msprime
import tskit
import tsinfer
num_samples = 8
ts = msprime.sim_ancestry(
num_samples,
ploidy=1,
sequence_length=10e8,
recombination_rate=1e-8,
record_full_arg=True,
random_seed=222
)
tsA = msprime.sim_mutations(ts, rate=1e-8, random_seed=123)
print(tsA.num_sites, "sites")
tsB = tsinfer.infer(tsinfer.SampleData.from_tree_sequence(tsA))
print(tsB.num_trees, "inferred trees")
def node_spans_max_children(ts):
node_spans = collections.defaultdict(list)
# node_id => [(left, right, [n_children1, n_children2,...]), ()]
curr_parents = collections.defaultdict(set)
for tree, diffs in zip(ts.trees(), ts.edge_diffs()):
for e_in in diffs.edges_in:
u = e_in.parent
if len(curr_parents[u]) == 0:
# node starts
node_spans[u].append(
[diffs.interval.left, diffs.interval.right, tree.num_children(u)]
)
else:
node_spans[u][-1][2] = max(node_spans[u][-1][2], tree.num_children(u))
curr_parents[e_in.parent].add(e_in.id)
for e_out in diffs.edges_out:
u = e_out.parent
curr_parents[u].remove(e_out.id)
if len(curr_parents[u]) == 0:
# node ends
node_spans[u][-1][1] = diffs.interval.right
for u, data in node_spans.items():
max_children = max(contiguous[2] for contiguous in data)
for contiguous in data:
if max_children > 1 and contiguous[2] < 2:
print("Node", u, "is contiguously unary from", contiguous)
node_spans_max_children(tsB)
43 sites
16 inferred trees
Node 16 is contiguously unary from [548658128.0, 993319578.0, 1]
tsB.draw_svg(time_scale="rank", style=".n16 > .sym {fill: red} .mut {display: none}")
Ah, great, nice example. I do agree that sometimes it'd be conceptually nice to have those nodes in their entireity, but the two operations are in principle different, and we should figure out whether we want one or both of them. If the unconnected, unary-only segments of a coalescent nodes are also inferrable, then we should also include them! But, I am not sure that they are? Do you have a good reason that they might be?
So, one question is: are these unary-only spans real or artifactual? Like, when tsinfer infers a unary-only bit, how often is that unary-only bit "correct"? In the case you've got here, the extra unary bit does not seem to be correct, at least it isn't a parent to the same samples. On my machine (tsinfer 0.2.3) the node in question is 14, not 16, but:
t = tsA.first()
orig_node = t.mrca(1, 3, 7)
t = tsB.first()
new_node = t.mrca(1, 3, 7)
assert new_node == 14
left = 0
sA = set()
sB = set()
for interval, tA, tB in tsA.coiterate(tsB):
new_sA = set(tA.samples(orig_node))
new_sB = set(tB.samples(new_node))
if sA != new_sA or sB != new_sB:
print(f"{left} to {interval[0]} the node is a parent to:")
print(f" truth: {sA}")
print(f" inferred: {sB}")
left = interval[0]
sA = new_sA
sB = new_sB
gives
0 to 0 the node is a parent to:
truth: set()
inferred: set()
147684860.0 to 160570091.0 the node is a parent to:
truth: {1, 3, 7}
inferred: {1, 3, 7}
169968287.0 to 173906518.0 the node is a parent to:
truth: {3, 7}
inferred: {1, 3, 7}
173906518.0 to 178484736.0 the node is a parent to:
truth: {3, 7}
inferred: {1, 3, 5, 6, 7}
178484736.0 to 204789241.0 the node is a parent to:
truth: {3, 7}
inferred: {3, 5, 7}
269704499.0 to 320784505.0 the node is a parent to:
truth: {3, 7}
inferred: set()
480931907.0 to 488755800.0 the node is a parent to:
truth: {3, 5, 7}
inferred: set()
540120992.0 to 548658128.0 the node is a parent to:
truth: {0, 3, 5, 7}
inferred: set()
564756120.0 to 576356796.0 the node is a parent to:
truth: {0, 3, 5, 7}
inferred: {3}
576356796.0 to 577972638.0 the node is a parent to:
truth: {0, 5, 7}
inferred: {3}
861036781.0 to 884623975.0 the node is a parent to:
truth: {0, 5, 7}
inferred: {3, 6}
959461190.0 to 983144527.0 the node is a parent to:
truth: {0, 5, 7}
inferred: set()
Have you got any situations where there's clearly information to infer those unary-only bits?
Moving this to https://github.com/petrelharp/num_edges/issues/2 so the discussion can continue easily even when the PR is closed.
I think what simplify is doing here is a reasonable operation, because it's not discarding any information. Given a full ARG, it keeps a full record of any topology for ancestors that are coalescent anywhere. This is a useful operation, and I think corresponds to what msprime may optionally output some day.
Implementing the exact mirror of edge_extend in simplify would be tricky I think, and much more complicated than this because you'd have to do a lot of reasoning about what's adjacent to what, and exclude edges depending on what they are adjacent to. I'm not enthusiastic about trying to do this - partly because it would be hard, but mostly because I don't see a lot of point (other than to provide the mirror-image of edge_extend). In practise the increase in size caused by extra unary-only edges will be negligible, I think, and not worth the substantially more complicated simplify algorithm (from say, the forward simulation perspective).
I guess we could do some experiments easily enough to see how significant the difference is by counting the number of unary-only edges output by this for a range of input parameters?
I agree 100% with Jerome here. The operation as currently defined is simple and easy to implement and explain. The one involving adjacency, although theoretically interesting, is much more complicated as a simplification method IMO.
As an example of an additional complication due to adjacency, what happens when all the current edges vanish and are replaced with a single different edge with the same parent. This would appear to be adjacent, but I'm not sure if it would be extended using @petrelharp's algorithm. The simplify method here can ignore all that stuff.
Sounds good - let's do some experiments and see what the difference is. I do worry though that the result will be harder to explain, though: to the question "why's that stuff in the tree sequence" it's less confusing to say "it's only what you can learn from the trees" than if we add "plus a bit more stuff that made the algorithm easier".
I've always seen edge_extend as being a method to infer a subset of the true ancestry (i.e., what's returned by this method), which is why I said it would be exact in the SMC case and an approximation for things like full Hudson and gene conversion. I would then put it the other way around - we can infer a subset (~the vast majority?) of the true unary nodes using these properties you're exploiting.
I think it's a pretty strong assertion to say that the current edge_extend algorithm is the best possible and no extra information about unary nodes could ever be derived from the trees.
This LGTM @jeromekelleher: it's nice it is relatively simple. I agree that a separate parameter is the way to go, in case people want to keep both unary in individuals and when coalescent (I don't think we test for combinations of the keep_unary_XXX params, do we?)
I assume that this would still remove a node which is unary in some places, and coalescent in others, but where the coalescent regions do not lead to more than descendant sample (because one of the lineages down from a child is a piece of "hanging" topology that does not lead to a sample). Should we test this too?
The only other comment I would have is to think about the name of the parameter. To someone who hasn't thought about it a lot, keep unary if coalescent could possibly seem like a contradiction in terms. It's more like keep_unary_if_coalescent_elsewhere, but that's too long. We could have keep_partially_unary, although it's nice to have all the params starting with keep_unary_XXXXX, which I presume is why we hit upon the current naming. It's probably fine as it is, if documented well.
The only other comment I would have is to think about the name of the parameter.
Another possible name: keep_unary_regions (to emphasise that it's not that we are keeping unary nodes per se, but the unary regions of nodes that we would have kept anyway).
Here's a possible docstring. It's a bit contorted, but the best I could do after a fair bit of rewording.
If True, keep unary regions of retained nodes. After simplification, unary nodes may continue to exist in local trees, but only if those nodes would have been retained by the simplification process anyway (e.g. because they represent coalescences between two or more lineages from sample nodes to the root). If
False, these nodes will still be present in the tree sequence, but are subject to removal from lineages where they are not coalescent points. Default:False
I think it's a pretty strong assertion to say that the current edge_extend algorithm is the best possible and no extra information about unary nodes could ever be derived from the trees.
Totally agree - I wasn't trying to say that it was!
How about
If True, keep all regions over which retained nodes are ancestral to samples. By default, nodes not in samples are only retained on regions of the genome where they are coalescent (i.e., a MRCA to at least two nodes in samples); if this option is True then nodes not in samples are still only retained if they are a MRCA to at least two nodes in samples, but all regions of the genome on which they are ancestral to a node in sample is also retained. Any extra retained information will show up as unary nodes on local trees. Default: False
OK - I did some quick experiments. This keep_unary argument seems to usually increase the number of edges by about as much as extend_edges decreases it. It's very interesting how much of this stuff there is!

So - this operation might still be useful, but I was wanting to include it in SLiM, to produce smaller, faster files. It looks like this operation won't do that. I'm not sure about the bigger picture at the moment.
Huh, isn't that interesting... I wonder if that's still true if you have a larger sample size? 6 is small enough that I can imagine odd things happening.
I thought I had some code and plots that showed that the hacked version generally reduced the number of edges. It could have been with a bigger sample size: I can't remember.
Hah, you're right about the number of samples:

... why would this happen? (i.e., why would more samples make the proportion of edges that are isolated unary edges on coalescent nodes smaller?)
... why would this happen? (i.e., why would more samples make the proportion of edges that are isolated unary edges on coalescent nodes smaller?)
Not obvious to me...
I wonder how this will change with a longer genome, though...
Ack, here's results for a longer sequence (0.1M instead of 0.01M):

Here's a bit more of an experiment changing both the number of samples and sequence length; recomb rate is 1e-8. Left is 10 samples; center is 100 samples; right is 1000.

Very interesting! I really didn't see this coming - the plot makes a clear practical case for edge_extend.
I guess we do need to implement it in simplify to be truly useful for forward sims. I think the present operation is useful also, but we'd need some new names I think.
Maybe we should move away from the phrase keep_unary entirely? I'm finding it less helpful every time I look at this stuff.