tskit
tskit copied to clipboard
Add order argument to TreeSequence.nodes
I was just after getting the oldest node(s) in the tree sequence, which I did by node_times = ts.tables.nodes.time; oldest = np.flatnonzero(node_times == np.max(node_times))
but beginners might not know how to do this sort of thing, which involves knowing about tables. Is it worth having a few wrapper functions for things like this (although I don't know where they would belong)?
Same for youngest nodes too, I guess. I don't know if returning an iterator over node objects would be nicer, e.g.
ts.nodes(restrict="oldest")
ts.nodes(restrict="youngest")
ts.nodes(restrict=array_of_node_ids)
or something like that.
There's a trade-off here between having useful methods and having a gigantic unwieldy API. Do you think this is a very common operation? My gut instinct here is that this is simple enough that it is a user operation.
Yes, I see that this is a bit of an edge case. One issue is that people might do
np.argmax(ts.tables.nodes.time)
which would only return the first of the oldest nodes (which is what I did before realising my mistake). The recommended way is (I guess, as above) a two-liner, which people are less likely to use:
node_times = ts.tables.nodes.time
np.flatnonzero(node_times == np.max(node_times))
Having that automatically available somehow would stop people making that mistake.
Implementing it by restricting the node iterator, as I suggested above, would mean not adding another method, at the cost of increasing the calling parameters of ts.nodes()
. I don't know what the best trade off is here, or indeed if (as you say), it's too much code bloat.
Another point is that mostly people expect the node table to have the nodes in time order (because that's what is output by msprime) but this is not guaranteed. So having a documented method to iterate over nodes in time order would be a way to ensure that this was enforced in a routine. Something like the following doesn't seem too bad an idea (although the exact semantics could probably be improved):
ts.nodes(time_asc=True) # "None" for arbitrary (ts) order, False for descending
Then the oldest node is next(ts.nodes(time_asc=False))
and the youngest is next(ts.nodes(time_asc=True))
. Or something like that anyway.
I have seen a fair bit of code that assumes that the samples are the first N nodes, and that the oldest node is the last one, so it might be useful to have a method that can be used in the examples without having to jump through numpy hoops.
Adding an order
argument similar to the Tree.nodes() method would solve the majority of this (ts.nodes(order="timedesc")
) I think. The order
defaults to increasing node ID.
Getting the k nodes that are exactly as old as each other is very niche I think and not worth a library function.
Adding an
order
argument similar to the Tree.nodes() method would solve the majority of this (ts.nodes(order="timedesc")
) I think. Theorder
defaults to increasing node ID.
Yes, I think this is nice and general. 👍
Changed issue title.
def nodes(self, *, order=None):
order = "id" if order is None else order
if order not in ["id", "timedesc", "timeasc"]:
raise ValueError(...)
# do stuff
In #2470 I noted that we want to break ties by using the node ID (i.e. np.lexsort((np.arange(ts.tables.nodes.num_rows), ts.tables.nodes.time))
). In particular, we should guarantee that order="timeasc"
gives the same order of nodes as used in edges.parent
(although some nodes, such as leaf nodes will be missing from edges.parent). This would then provide consistency for ARG traversal methods (see https://github.com/tskit-dev/tskit/discussions/2469)