tskit
tskit copied to clipboard
Add TreeSequence.max_time, min_time attributes
There are many situations where it's handy to know what the max and min times are, and it would be nice if we could avoid having to do a full lookup of the np.max(ts.tables.nodes.time) to do it.
I think we can reasonably add these as attributes of the tsk_treeseq_t struct in C, filling them in during startup.
We can make them available to Python then easily, too.
I've put this in on the before-paper milestone as it seems an obviously useful thing to do.
Do we also want the minimum and maximum time only counting nodes in trees? That might not be the same as np.max(ts.tables.nodes.time) and np.min(ts.tables.nodes.time), if there are non-sample nodes that are not in the edge tables.
it's a slightly more complex calculation, so seems even more worthwhile caching.
What would you call this? ts.max_time has a pretty clear interpretation. I guess ts.max_tree_node_time?
What are the applications?
We have ts.max_root_time already, which I think is the maximum, right? so it's only the minimum that is missing. I can't remember if max_root_time is cached or not though.
A trivial application is to find the best scales for plotting. I'm sure there are lots of others though. My guess is that most of the time you want the min/max time of nodes in the trees, and want to omit the nodes that aren't in the topologies. In fact, most people will assume that max_time means max_root_time, won't they? And of course it will be as long as the ts is simplified.
A couple of extra thoughts here before we make this concrete:
- Should
max_timeinclude the max time of mutations too? If not, then should it actually be namedmax_node_time(and equivalently,min_node_time)? - do we count unreferenced non-sample nodes (i.e. isolated everywhere) as "in the tree sequence"? I can't remember if we actually state this anywhere. It could be argued that
tree_sequence.max_timeis the max time of any mutation or any node that is in the sequence of trees (i.e. only counting referenced or sampled nodes, and mutations above them)
My two cents (mostly, for not complicating things):
- Should max_time include the max time of mutations too?
Sure, why not.
- do we count unreferenced non-sample nodes (i.e. isolated everywhere) as "in the tree sequence"?
Yes, and they should not be missed out (unless you have a use case where such nodes are in the tree-sequence-data-structure and have very weird times?).
I think we should keep ts.max_time simple - it is the maximum of the times of the entities in the data model (and conversely for min_time) - np.max([np.max(ts.tables.nodes.time), np.max(ts.tables.mutations.time)])
If there are other things we want, let's open a separate issue for them so that this issue can be closed by a single PR.
Great, if it includes the mutation times too, then I can see max_time and min_time being reasonable names. I think it should be documented about the possibility of them not being in the sequence of trees, though.
I assume that the time of the virtual root, since it is not actually stored in a table (I think?) is not considered to be part of the data model.
I assume that the time of the virtual root, since it is not actually stored in a table (I think?) is not considered to be part of the data model.
That's right - the time of the virtual root is not stored, and defined as positive infinity.