tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Add order argument to TreeSequence.nodes

Open hyanwong opened this issue 2 years ago • 7 comments

I was just after getting the oldest node(s) in the tree sequence, which I did by node_times = ts.tables.nodes.time; oldest = np.flatnonzero(node_times == np.max(node_times)) but beginners might not know how to do this sort of thing, which involves knowing about tables. Is it worth having a few wrapper functions for things like this (although I don't know where they would belong)?

hyanwong avatar Jun 29 '22 08:06 hyanwong

Same for youngest nodes too, I guess. I don't know if returning an iterator over node objects would be nicer, e.g.

ts.nodes(restrict="oldest")
ts.nodes(restrict="youngest")
ts.nodes(restrict=array_of_node_ids)

or something like that.

hyanwong avatar Jun 29 '22 08:06 hyanwong

There's a trade-off here between having useful methods and having a gigantic unwieldy API. Do you think this is a very common operation? My gut instinct here is that this is simple enough that it is a user operation.

benjeffery avatar Jun 29 '22 16:06 benjeffery

Yes, I see that this is a bit of an edge case. One issue is that people might do

np.argmax(ts.tables.nodes.time)

which would only return the first of the oldest nodes (which is what I did before realising my mistake). The recommended way is (I guess, as above) a two-liner, which people are less likely to use:

node_times = ts.tables.nodes.time
np.flatnonzero(node_times == np.max(node_times))

Having that automatically available somehow would stop people making that mistake.

Implementing it by restricting the node iterator, as I suggested above, would mean not adding another method, at the cost of increasing the calling parameters of ts.nodes(). I don't know what the best trade off is here, or indeed if (as you say), it's too much code bloat.

hyanwong avatar Jun 30 '22 07:06 hyanwong

Another point is that mostly people expect the node table to have the nodes in time order (because that's what is output by msprime) but this is not guaranteed. So having a documented method to iterate over nodes in time order would be a way to ensure that this was enforced in a routine. Something like the following doesn't seem too bad an idea (although the exact semantics could probably be improved):

ts.nodes(time_asc=True)  # "None" for arbitrary (ts) order, False for descending

Then the oldest node is next(ts.nodes(time_asc=False)) and the youngest is next(ts.nodes(time_asc=True)). Or something like that anyway.

I have seen a fair bit of code that assumes that the samples are the first N nodes, and that the oldest node is the last one, so it might be useful to have a method that can be used in the examples without having to jump through numpy hoops.

hyanwong avatar Jun 30 '22 07:06 hyanwong

Adding an order argument similar to the Tree.nodes() method would solve the majority of this (ts.nodes(order="timedesc")) I think. The order defaults to increasing node ID.

Getting the k nodes that are exactly as old as each other is very niche I think and not worth a library function.

jeromekelleher avatar Jun 30 '22 08:06 jeromekelleher

Adding an order argument similar to the Tree.nodes() method would solve the majority of this (ts.nodes(order="timedesc")) I think. The order defaults to increasing node ID.

Yes, I think this is nice and general. 👍

hyanwong avatar Jun 30 '22 08:06 hyanwong

Changed issue title.

def nodes(self, *, order=None):
     order = "id" if order is None else order
     if order not in ["id", "timedesc", "timeasc"]:
          raise ValueError(...)
     # do stuff

jeromekelleher avatar Jun 30 '22 08:06 jeromekelleher

In #2470 I noted that we want to break ties by using the node ID (i.e. np.lexsort((np.arange(ts.tables.nodes.num_rows), ts.tables.nodes.time))). In particular, we should guarantee that order="timeasc" gives the same order of nodes as used in edges.parent (although some nodes, such as leaf nodes will be missing from edges.parent). This would then provide consistency for ARG traversal methods (see https://github.com/tskit-dev/tskit/discussions/2469)

hyanwong avatar Aug 23 '22 14:08 hyanwong