tsinfer
tsinfer copied to clipboard
Removing entirely unary nodes
At the moment, in tsinfer, we don’t remove unary regions of nodes even if the nodes are unary everywhere. I.e. we have nodes which are coalescent in places and unary in other places, but also some entirely unary nodes. I think these entirely unary ones are untouched by @petrelharp’s extend_edges routine. In a quick test, if we remove these never-coalescent nodes, we reduce both the number of edges and the number of nodes. Should we remove them by default? (probably yes)
An argument for keeping them is that they could “bring together” lineages at different points in the genome into a common ancestor, which could then share a single path back to more distance ancestors (i.e. these could be CA-non-coalescent nodes in Hudson ARG terminology). I’m not sure how useful that is, but there is an argument, I suppose, for adding this as an option to tsinfer.post_process(), e.g. keep_unary_regions (default="for coalescent_nodes").
We can see how many there are using the code in https://github.com/tskit-dev/tskit/discussions/2774. E.g. for an inferred TS (its) of 2000 samples over 10Mb, we have:
import tskit
def remove_entirely_unary(ts):
def node_is_coalescent(ts):
is_coalescent = np.zeros(ts.num_nodes, dtype=bool)
for tree in ts.trees():
is_coalescent[tree.num_children_array[:-1] > 1] = True
return np.where(is_coalescent)[0]
# Only keep unary nodes where coalescent (hack using keep_unary_in_individuals)
tables = ts.dump_tables()
nodes_individual = tables.nodes.individual
for u in node_is_coalescent(ts):
nodes_individual[u] = tables.individuals.add_row()
tables.nodes.individual = nodes_individual
tables.simplify(keep_unary_in_individuals=True, filter_sites=False)
# remove the individuals we added
nodes_individual = tables.nodes.individual
nodes_individual[nodes_individual >= ts.num_individuals] = tskit.NULL
tables.nodes.individual = nodes_individual
tables.individuals.truncate(ts.num_individuals)
return tables.tree_sequence()
its = tskit.load("inferred.trees")
semi_unary_ts = remove_entirely_unary(its)
print("num nodes", its.num_nodes, semi_unary_ts.num_nodes, its.simplify().num_nodes)
print("num edges", its.num_edges, semi_unary_ts.num_edges, its.simplify().num_edges)
The middle number is the one where entirely unary nodes have been removed.
num nodes 25922 22184 22184
num edges 75391 70940 104599
What percentage of nodes/edges would we remove from an inferred ts from real data?
Good question. I will test on one of the unified genealogy tree sequences (before simplification: these are the files that were recently moved in rescomp)
Interestingly, removing entirely-unary nodes can sometimes make dating worse, in my tests of the new variational_gamma method. So maybe we should leave this for users to do if they want to compress the tree sequence a bit more.