tskit
tskit copied to clipboard
Average arity of nodes in the trees
It's useful to have a summary of the average arity of nodes in the trees. Here's a function that computes it:
def nodes_mean_arity(ts):
weighted_arity = np.zeros(ts.num_nodes)
span_in_tree = np.zeros(ts.num_nodes)
for tree in ts.trees():
weighted_arity += tree.num_children_array[:-1] * tree.span
span_in_tree[tree.num_children_array[:-1] > 0] += tree.span
span_in_tree[ts.samples()] = ts.sequence_length
weighted_arity /= span_in_tree
return weighted_arity
We could add this as a function to the TreeSequence? (The naming is chosen to follow the arrays like nodes_time etc)
We can imagine having different options for weighting, at some point.
To do it really efficiently, we'd need to have to implement in C with an edge-diffs approach, but this implementation is probably quite fast for most purposes.
Although, I wonder if there a way of expressing the operation as a node stat?
There's a more efficient implementation by simply adding up all the edge spans per parent node, and dividing by the span for which the parent node is in the tree:
span_sums = np.zeros(ts.num_nodes)
# Find the edge indices where parents change
i = np.insert(np.nonzero(np.diff(ts.edges_parent))[0] + 1, 0, 0)
span_sums[ts.edges_parent[i]] = np.add.reduceat(ts.edges_right - ts.edges_left, i)
# Now divide by the node spans
This is insanely quick, but will be slowed down by the node span calculations, which need to be done using edge diffs. Here's the node span implementation that I thought might be generally useful (see https://github.com/tskit-dev/tskit/discussions/2718):
def node_spans_parent(ts, min_num_children=None):
"""
Returns the spans over which each node is a parent of greater or equal to
min_num_children children.
Note - to find the "general node spans", i.e. the spans over which nodes are present in all local
trees, you may wish to include cases where nodes are children as well as parents. If all leaf
nodes in your trees are sample nodes, as is usually the case, the general node spans can simply
be found by setting sample node spans to the total sequence length (as by definition, samples
are present in all trees in the tree sequence). This can be done as follows
node_span = node_spans_parent(ts)
node_span[ts.samples()] = ts.sequence_length
Note - if you wish to ignore regions over which a node is a unary node, you can set
min_num_children = 2
Note - this also counts "dead" branches
"""
if min_num_children is None:
min_num_children = 1
num_children = np.zeros(ts.num_nodes, dtype=np.int32)
span_start = np.zeros(ts.num_nodes)
node_span = np.zeros(ts.num_nodes)
for interval, edges_out, edges_in in ts.edge_diffs(include_terminal=True):
touched=set()
for edge in edges_out:
num_children[edge.parent] -= 1
if num_children[edge.parent] == min_num_children - 1:
# Should no longer count this node
node_span[edge.parent] += interval.left - span_start[edge.parent]
for edge in edges_in:
num_children[edge.parent] += 1
if num_children[edge.parent] == min_num_children:
span_start[edge.parent] = interval.left
return node_span
I believe you can also compute node spans as
node_spans = ts.sample_count_stat([ts.samples()], lambda x: (x > 0), 1, polarised=True, span_normalise=False, strict=False, mode='node')
Edit: I got polarised wrong the first time - should be True.
This solution is very nice, @hyanwong. I think you can get span_sums as follows, somewhat more simply:
span_sums = np.bincount(ts.edges_parent, weights=ts.edges_right - ts.edges_left, minlength=ts.num_nodes)
So, that'd be
def nodes_mean_arity2(ts):
span_sums = np.bincount(ts.edges_parent, weights=ts.edges_right - ts.edges_left, minlength=ts.num_nodes)
node_spans = ts.sample_count_stat([ts.samples()], lambda x: (x > 0), 1, polarised=True, span_normalise=False, strict=False, mode='node')[:,0]
return span_sums / node_spans
... which seems to agree with Jerome's implementation!
I believe you can also compute node spans ...
Oh, brilliant. That's a really neat thing to know. It would be fantastic to either document this node span one-liner somewhere, or add a simple wrapper function to return it?
node_spans = ts.sample_count_stat...
However, in a basic test, running @petrelharp's sample_count_stat approach on the large Covid tree seq (1.4M nodes, 1496 trees) took 10 mins on my laptop, whereas running the node_spans_parent function that I showed above took 21 seconds. So it might be worth having something like this in tskit anyway? I'm not sure if anything could be sped up e.g. using numba, but I think not, and also suspect that the sample_count_stat might already be quite optimised?
However, the bincount approach is about twice as fast on the covid ARG as my np.add.reduceat approach, although it's trivial in terms of the total time for the calc (6 milliseconds vs 12 milliseconds, on average!)
What would be nice is the ability to slice the arity stat to average over genome region and/or timeslices, which would put it more into the stats framework, right?
Super neat!
That won't be very efficient as it's calling back to python, but we could easily implement the node spans in C using this approach
took 10 mins on my laptop, whereas running the node_spans_parent function that I showed above took 21 seconds.
Huh, that's confusing! I suppose it might possibly be due to the C code calling back out to the python summary function?!? Otherwise I don't see why that code should take a long time; we've not got high-dimesional input or output...
I do think this should be a method - there's other use cases for it (maybe there's even an issue for this already...)
Huh, that's confusing!
Perhaps you could check on your machine, with a big msprime-generated tree sequence, Peter?
Confusing! I suppose it might possibly be due to the C code calling back out to the python summary function?!? Otherwise I don't see why that code should take a long time; we've not got high-dimesional input or output...
This is definitely because the summary func is calling back to Python - I wouldn't expect this to be efficient, or even indicative of real performance. We would need to implement in C - but it's easy.
I agree we should make this stuff available as API methods, it's just a question of how to break things up and package.
I'm happy to do that; I think the main question is what to call the function; weigh in over at #2725.
Closing for inactivity and labelling "future", please re-open if you plan to work on this.