tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Add flags_array and time_array to Tree class?

Open jeromekelleher opened this issue 4 years ago • 9 comments

#1320 added arrays for the underlying tree vectors. I think it's probably a good idea to add the flags_array and time_array to the Tree class also, since these are things we'll often be using in low-level algorithms, and it's a faff (and currently copy-ful, semantics-wise) to do this via the tree sequence tables arrays.

On a technical node, I think we'd probably need to do this from the bottom up like the Tree arrays, using the Tree class as the base object for the numpy arrays. We could just create arrays in the high-level Tree class, but these would be copies and so technically incur a perf penalty (even if it would probably be negligible).

jeromekelleher avatar Apr 20 '21 18:04 jeromekelleher

Bumping this to 0.3.7

benjeffery avatar May 13 '21 11:05 benjeffery

Bumping this up the priority list as we need it for phylokit work.

jeromekelleher avatar Jun 28 '22 11:06 jeromekelleher

A tricky issue here is how we deal with the fact that the dimensions of these arrays are different to the other tree arrays. If we use a direct pointer to the table memory, then we don't have any space for the time of the virtual root.

jeromekelleher avatar Jul 04 '22 09:07 jeromekelleher

Hmm, I guess it wouldn't be a breaking change to always have the node table be one row larger than it actually is and that row have the correct null/unknown values? The row count wouldn't change, and we recently created space for that row with the change to the max rows. Haven't fully thought this through, but seems possible.

benjeffery avatar Jul 04 '22 09:07 benjeffery

We could do this, but will be tricky ensuring that, e.g., the value of time is inf.

jeromekelleher avatar Jul 04 '22 11:07 jeromekelleher

The two options would be:

  • Get tsk_node_table_expand_main_columns to write the virtual root values to all the new space. I think this would be a perf and memory hit given what we learned recently about malloc being "lazy", at least on linux.
  • Have tsk_node_table_add_row_internal write the virtual root values to the node row after the added one. This might not be too bad perf wise, worth measuring. tsk_node_table_append_columns, tsk_node_table_init and tsk_node_table_extend would all have to do the same.

benjeffery avatar Jul 05 '22 10:07 benjeffery

Seems quite tricky. Probably less fragile to just copy those arrays into permanent memory in the tree sequence object?

jeromekelleher avatar Jul 05 '22 10:07 jeromekelleher

Yeah, probably the right choice. Could be created lazily to avoid the memory penalty to non-users?

benjeffery avatar Jul 06 '22 10:07 benjeffery

Note there's less motivation for adding these arrays now that we have the node_flags etc array on the TreeSequence object (#2424). Probably not worth the bother of adding these here? Let's see how things look downstream first (e.g. in phylokit) before committing to the approach outlined here.

jeromekelleher avatar Jul 21 '22 14:07 jeromekelleher

Closing this for now - can reopen if we really want them later.

jeromekelleher avatar Oct 05 '22 12:10 jeromekelleher