tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Methods for reordering nodes and individuals

Open hyanwong opened this issue 4 years ago • 13 comments

I would like to be able to reorder the individuals so that the IDs change but the structure is all the same. I would similarly like to be able to do this for nodes. I don't think there are built-in functions to do this yet, although TreeSequence.subset will do it as a side effect for nodes, if given an ordering that lists all the node IDs.

hyanwong avatar Dec 11 '21 14:12 hyanwong

Ok - so, what exactly do you need to do? subset( ) will let you put the nodes in arbitrary order, and then orders individuals by first occurrence of node, IIRC.

petrelharp avatar Dec 11 '21 14:12 petrelharp

So I'm assuming this would be a method on TableCollection that would take an array for each table that specifies the new order (or None for no change)? Sounds do-able. I'd like to understand a little more about the context you're doing this in though?

benjeffery avatar Dec 16 '21 12:12 benjeffery

Actually it turns out that I don't need it in this instance. But there have been times in the past when I have wanted to do this.

Here, I wanted it to be able to order all individuals at a given time (in generations) so that when visualising their pedigree, the lines connecting them had crossings minimised. I had the order within each generation, but it was fiddly re-sorting the indexes in the right way, because I would have to remap the individuals.parents fields, as well as the nodes.individual field. I could imagine easily forgetting which references I needed to change.

Previously I've wanted, for example, to put all the sample nodes as the first n nodes, even when some are historical samples (and so there might be younger internal nodes). But that's just off the top of my head.

hyanwong avatar Dec 16 '21 12:12 hyanwong

Hi @benjeffery - this seems like it would be useful in some tsinfer improvements that @szhan might make (see https://github.com/tskit-dev/tsinfer/issues/189#issuecomment-1032042549 for context). I think this might be a simple enough addition to tskit for him to try making a PR about it. Does that seem sensible? Are there any misgivings about adding the functionality to tskit?

hyanwong avatar Feb 10 '22 14:02 hyanwong

I guess at a minimum we would want TableCollection.reorder_tables(nodes=X, individuals=Y, populations=Z), since those are the only tables which are allowed to have an arbitrary order. I'm not sure about migrations?

The ability to reorder the other tables edges, sites, etc to form a non-valid table collection is a bit esoteric, so not sure we want to bother with that.

hyanwong avatar Feb 10 '22 15:02 hyanwong

How will this interact with subset? For instance, if nodes=X would X be required to have all of the nodes? One way forward would be to add individual_order=Y and population_order=Z arguments to subset( ) (and then possibly rename it reorder( ) since maybe that's a better name)?

The ability to reorder the other tables edges, sites, etc to form a non-valid table collection is a bit esoteric, so not sure we want to bother with that.

Definitely.

petrelharp avatar Feb 10 '22 16:02 petrelharp

Are there any misgivings about adding the functionality to tskit?

I don't think so - is this a perf sensitive application? We normally add these things in at the python layer, at least initially and only push down to C if needed. @szhan should go ahead and PR something for sure, especially if it isn't much work to get to a prototype that we can concretely discuss.

benjeffery avatar Feb 11 '22 10:02 benjeffery

No, I don't think this is perf-sensitive. It might be done for a large tree sequence, but it is only done once in the entire pipeline, so it will be a minuscule part of the whole. I think implementing in python is the way forward.

However, I guess we want to decide on whether (as @petrelharp says) we want to roll this into the "subset" method?

hyanwong avatar Feb 11 '22 10:02 hyanwong

However, I guess we want to decide on whether (as @petrelharp says) we want to roll this into the "subset" method?

I'd rather have a separate method so the two uses are clear, even if there is some small overlap in functionality.

benjeffery avatar Feb 11 '22 10:02 benjeffery

OK, If @petrelharp agrees, I think the method / API is well defined enough for @szhan to have a go at it:

TableCollection.reorder(nodes=None, individuals=None, populations=None)

Acts in place, and could potentially return a tuple of 3 numpy arrays mapping old->new indexes? That could be a namedtuple, I guess, if we want to be sophisticated:

ReorderMap = collections.namedtuple("ReorderMap", "nodes, individuals, populations")
class TableCollection:
    ...
    def reorder(nodes=None, individuals=None, populations=None):
        node_map=None
        individuals_map=None
        populations_map=None
        ...
        return ReorderMap(nodes=node_map, individuals= individuals_map, populations=populations_map)

hyanwong avatar Feb 11 '22 10:02 hyanwong

Need not even have tuple-like behaviour, could just be a simple dataclass. Possibly better, since the tuple-like semantics are often unhelpful.

jeromekelleher avatar Feb 11 '22 11:02 jeromekelleher

Soiunds good to me! However, it'd be nice to have a "see also" note in the docstring for both of them.

petrelharp avatar Feb 16 '22 05:02 petrelharp

Here's some demo code showing the logic, @szhan. ~~I think I have it right, but you should check that the mapping is the right way around.~~ Now edited - you do need to create an inverted map

import numpy as np
import msprime
ts = msprime.sim_ancestry(10, sequence_length=100, recombination_rate=0.1, random_seed=1)
ts = msprime.sim_mutations(ts, rate=0.1, random_seed=1)

def reorder_nodes(tables, node_map):
    assert len(np.unique(node_map)) == len(node_map) == tables.nodes.num_rows
    assert np.min(node_map) == 0
    assert np.max(node_map) == tables.nodes.num_rows -1
    
    inverted_map = np.zeros(tables.nodes.num_rows, dtype=tables.edges.parent.dtype)
    inverted_map[node_map] = np.arange(tables.nodes.num_rows)

    tables.nodes.replace_with(tables.nodes[node_map])
    tables.edges.parent = inverted_map[tables.edges.parent]
    tables.edges.child = inverted_map[tables.edges.child]
    tables.mutations.node = inverted_map[tables.mutations.node]
    tables.sort()


random_node_order = np.arange(ts.num_nodes)
np.random.seed(123)  # make this example replicable
np.random.shuffle(random_node_order)
table_collection = ts.dump_tables()
reorder_nodes(table_collection, random_node_order)
new_ts = table_collection.tree_sequence()
# check the diversity is the same
print(ts.diversity(), new_ts.diversity())

# more intensive: check that the (unlabelled) topologies are the same
for interval, tree1, tree2 in ts.coiterate(new_ts):
    assert tree1.rank()[0] == tree2.rank()[0]  # could use rank().shape once https://github.com/tskit-dev/tskit/pull/2392 is merged

hyanwong avatar Jul 07 '22 16:07 hyanwong