tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Add TreeSequence.max_time, min_time attributes

Open jeromekelleher opened this issue 3 years ago • 9 comments
trafficstars

There are many situations where it's handy to know what the max and min times are, and it would be nice if we could avoid having to do a full lookup of the np.max(ts.tables.nodes.time) to do it.

I think we can reasonably add these as attributes of the tsk_treeseq_t struct in C, filling them in during startup.

We can make them available to Python then easily, too.

jeromekelleher avatar May 13 '22 16:05 jeromekelleher

I've put this in on the before-paper milestone as it seems an obviously useful thing to do.

benjeffery avatar May 14 '22 21:05 benjeffery

Do we also want the minimum and maximum time only counting nodes in trees? That might not be the same as np.max(ts.tables.nodes.time) and np.min(ts.tables.nodes.time), if there are non-sample nodes that are not in the edge tables.

it's a slightly more complex calculation, so seems even more worthwhile caching.

hyanwong avatar May 26 '22 19:05 hyanwong

What would you call this? ts.max_time has a pretty clear interpretation. I guess ts.max_tree_node_time?

What are the applications?

jeromekelleher avatar May 26 '22 19:05 jeromekelleher

We have ts.max_root_time already, which I think is the maximum, right? so it's only the minimum that is missing. I can't remember if max_root_time is cached or not though.

A trivial application is to find the best scales for plotting. I'm sure there are lots of others though. My guess is that most of the time you want the min/max time of nodes in the trees, and want to omit the nodes that aren't in the topologies. In fact, most people will assume that max_time means max_root_time, won't they? And of course it will be as long as the ts is simplified.

hyanwong avatar May 26 '22 19:05 hyanwong

A couple of extra thoughts here before we make this concrete:

  1. Should max_time include the max time of mutations too? If not, then should it actually be named max_node_time (and equivalently, min_node_time)?
  2. do we count unreferenced non-sample nodes (i.e. isolated everywhere) as "in the tree sequence"? I can't remember if we actually state this anywhere. It could be argued that tree_sequence.max_time is the max time of any mutation or any node that is in the sequence of trees (i.e. only counting referenced or sampled nodes, and mutations above them)

hyanwong avatar May 26 '22 22:05 hyanwong

My two cents (mostly, for not complicating things):

  1. Should max_time include the max time of mutations too?

Sure, why not.

  1. do we count unreferenced non-sample nodes (i.e. isolated everywhere) as "in the tree sequence"?

Yes, and they should not be missed out (unless you have a use case where such nodes are in the tree-sequence-data-structure and have very weird times?).

petrelharp avatar May 26 '22 23:05 petrelharp

I think we should keep ts.max_time simple - it is the maximum of the times of the entities in the data model (and conversely for min_time) - np.max([np.max(ts.tables.nodes.time), np.max(ts.tables.mutations.time)])

If there are other things we want, let's open a separate issue for them so that this issue can be closed by a single PR.

jeromekelleher avatar May 27 '22 08:05 jeromekelleher

Great, if it includes the mutation times too, then I can see max_time and min_time being reasonable names. I think it should be documented about the possibility of them not being in the sequence of trees, though.

I assume that the time of the virtual root, since it is not actually stored in a table (I think?) is not considered to be part of the data model.

hyanwong avatar May 27 '22 11:05 hyanwong

I assume that the time of the virtual root, since it is not actually stored in a table (I think?) is not considered to be part of the data model.

That's right - the time of the virtual root is not stored, and defined as positive infinity.

jeromekelleher avatar May 30 '22 10:05 jeromekelleher