tskit Add TreeSequence.max_time, min

trafficstars

There are many situations where it's handy to know what the max and min times are, and it would be nice if we could avoid having to do a full lookup of the np.max(ts.tables.nodes.time) to do it.

I think we can reasonably add these as attributes of the tsk_treeseq_t struct in C, filling them in during startup.

We can make them available to Python then easily, too.

May 13 '22 16:05 jeromekelleher

I've put this in on the before-paper milestone as it seems an obviously useful thing to do.

May 14 '22 21:05 benjeffery

Do we also want the minimum and maximum time only counting nodes in trees? That might not be the same as np.max(ts.tables.nodes.time) and np.min(ts.tables.nodes.time), if there are non-sample nodes that are not in the edge tables.

it's a slightly more complex calculation, so seems even more worthwhile caching.

May 26 '22 19:05 hyanwong

What would you call this? ts.max_time has a pretty clear interpretation. I guess ts.max_tree_node_time?

What are the applications?

May 26 '22 19:05 jeromekelleher

We have ts.max_root_time already, which I think is the maximum, right? so it's only the minimum that is missing. I can't remember if max_root_time is cached or not though.

A trivial application is to find the best scales for plotting. I'm sure there are lots of others though. My guess is that most of the time you want the min/max time of nodes in the trees, and want to omit the nodes that aren't in the topologies. In fact, most people will assume that max_time means max_root_time, won't they? And of course it will be as long as the ts is simplified.

May 26 '22 19:05 hyanwong

A couple of extra thoughts here before we make this concrete:

Should max_time include the max time of mutations too? If not, then should it actually be named max_node_time (and equivalently, min_node_time)?
do we count unreferenced non-sample nodes (i.e. isolated everywhere) as "in the tree sequence"? I can't remember if we actually state this anywhere. It could be argued that tree_sequence.max_time is the max time of any mutation or any node that is in the sequence of trees (i.e. only counting referenced or sampled nodes, and mutations above them)

May 26 '22 22:05 hyanwong

My two cents (mostly, for not complicating things):

Should max_time include the max time of mutations too?

Sure, why not.

do we count unreferenced non-sample nodes (i.e. isolated everywhere) as "in the tree sequence"?

Yes, and they should not be missed out (unless you have a use case where such nodes are in the tree-sequence-data-structure and have very weird times?).

May 26 '22 23:05 petrelharp

I think we should keep ts.max_time simple - it is the maximum of the times of the entities in the data model (and conversely for min_time) - np.max([np.max(ts.tables.nodes.time), np.max(ts.tables.mutations.time)])

If there are other things we want, let's open a separate issue for them so that this issue can be closed by a single PR.

May 27 '22 08:05 jeromekelleher

Great, if it includes the mutation times too, then I can see max_time and min_time being reasonable names. I think it should be documented about the possibility of them not being in the sequence of trees, though.

I assume that the time of the virtual root, since it is not actually stored in a table (I think?) is not considered to be part of the data model.

May 27 '22 11:05 hyanwong

I assume that the time of the virtual root, since it is not actually stored in a table (I think?) is not considered to be part of the data model.

That's right - the time of the virtual root is not stored, and defined as positive infinity.

May 30 '22 10:05 jeromekelleher

tskit
tskit copied to clipboard

Add TreeSequence.max_time, min_time attributes

tskit tskit copied to clipboard

Add TreeSequence.max_time, min_time attributes

tskit
tskit copied to clipboard