tskit
tskit copied to clipboard
Add flags_array and time_array to Tree class?
#1320 added arrays for the underlying tree vectors. I think it's probably a good idea to add the flags_array and time_array to the Tree class also, since these are things we'll often be using in low-level algorithms, and it's a faff (and currently copy-ful, semantics-wise) to do this via the tree sequence tables arrays.
On a technical node, I think we'd probably need to do this from the bottom up like the Tree arrays, using the Tree class as the base object for the numpy arrays. We could just create arrays in the high-level Tree class, but these would be copies and so technically incur a perf penalty (even if it would probably be negligible).
Bumping this to 0.3.7
Bumping this up the priority list as we need it for phylokit work.
A tricky issue here is how we deal with the fact that the dimensions of these arrays are different to the other tree arrays. If we use a direct pointer to the table memory, then we don't have any space for the time of the virtual root.
Hmm, I guess it wouldn't be a breaking change to always have the node table be one row larger than it actually is and that row have the correct null/unknown values? The row count wouldn't change, and we recently created space for that row with the change to the max rows. Haven't fully thought this through, but seems possible.
We could do this, but will be tricky ensuring that, e.g., the value of time is inf.
The two options would be:
- Get
tsk_node_table_expand_main_columnsto write the virtual root values to all the new space. I think this would be a perf and memory hit given what we learned recently about malloc being "lazy", at least on linux. - Have
tsk_node_table_add_row_internalwrite the virtual root values to the node row after the added one. This might not be too bad perf wise, worth measuring.tsk_node_table_append_columns,tsk_node_table_initandtsk_node_table_extendwould all have to do the same.
Seems quite tricky. Probably less fragile to just copy those arrays into permanent memory in the tree sequence object?
Yeah, probably the right choice. Could be created lazily to avoid the memory penalty to non-users?
Note there's less motivation for adding these arrays now that we have the node_flags etc array on the TreeSequence object (#2424). Probably not worth the bother of adding these here? Let's see how things look downstream first (e.g. in phylokit) before committing to the approach outlined here.
Closing this for now - can reopen if we really want them later.