tsinfer icon indicating copy to clipboard operation
tsinfer copied to clipboard

Removing entirely unary nodes

Open hyanwong opened this issue 2 years ago • 3 comments

At the moment, in tsinfer, we don’t remove unary regions of nodes even if the nodes are unary everywhere. I.e. we have nodes which are coalescent in places and unary in other places, but also some entirely unary nodes. I think these entirely unary ones are untouched by @petrelharp’s extend_edges routine. In a quick test, if we remove these never-coalescent nodes, we reduce both the number of edges and the number of nodes. Should we remove them by default? (probably yes)

An argument for keeping them is that they could “bring together” lineages at different points in the genome into a common ancestor, which could then share a single path back to more distance ancestors (i.e. these could be CA-non-coalescent nodes in Hudson ARG terminology). I’m not sure how useful that is, but there is an argument, I suppose, for adding this as an option to tsinfer.post_process(), e.g. keep_unary_regions (default="for coalescent_nodes").

We can see how many there are using the code in https://github.com/tskit-dev/tskit/discussions/2774. E.g. for an inferred TS (its) of 2000 samples over 10Mb, we have:

import tskit
def remove_entirely_unary(ts):
    def node_is_coalescent(ts):
        is_coalescent = np.zeros(ts.num_nodes, dtype=bool)
        for tree in ts.trees():
            is_coalescent[tree.num_children_array[:-1] > 1] = True
        return np.where(is_coalescent)[0]
    
    # Only keep unary nodes where coalescent (hack using keep_unary_in_individuals)
    tables = ts.dump_tables()
    nodes_individual = tables.nodes.individual
    for u in node_is_coalescent(ts):
        nodes_individual[u] = tables.individuals.add_row()
    tables.nodes.individual = nodes_individual
    tables.simplify(keep_unary_in_individuals=True, filter_sites=False)
    # remove the individuals we added
    nodes_individual = tables.nodes.individual
    nodes_individual[nodes_individual >= ts.num_individuals] = tskit.NULL
    tables.nodes.individual = nodes_individual
    tables.individuals.truncate(ts.num_individuals)
    return tables.tree_sequence()

its = tskit.load("inferred.trees")
semi_unary_ts = remove_entirely_unary(its)
print("num nodes", its.num_nodes, semi_unary_ts.num_nodes, its.simplify().num_nodes)
print("num edges", its.num_edges, semi_unary_ts.num_edges, its.simplify().num_edges)

The middle number is the one where entirely unary nodes have been removed.

num nodes 25922 22184 22184
num edges 75391 70940 104599

hyanwong avatar Jul 11 '23 14:07 hyanwong

What percentage of nodes/edges would we remove from an inferred ts from real data?

jeromekelleher avatar Jul 13 '23 08:07 jeromekelleher

Good question. I will test on one of the unified genealogy tree sequences (before simplification: these are the files that were recently moved in rescomp)

hyanwong avatar Jul 13 '23 09:07 hyanwong

Interestingly, removing entirely-unary nodes can sometimes make dating worse, in my tests of the new variational_gamma method. So maybe we should leave this for users to do if they want to compress the tree sequence a bit more.

hyanwong avatar Jan 14 '24 11:01 hyanwong