tskit Average arity of nodes in the trees

trafficstars

It's useful to have a summary of the average arity of nodes in the trees. Here's a function that computes it:

def nodes_mean_arity(ts):
    weighted_arity = np.zeros(ts.num_nodes)
    span_in_tree = np.zeros(ts.num_nodes)
    for tree in ts.trees():
        weighted_arity += tree.num_children_array[:-1] * tree.span
        span_in_tree[tree.num_children_array[:-1] > 0] += tree.span
    span_in_tree[ts.samples()] = ts.sequence_length
    weighted_arity /= span_in_tree
    return weighted_arity

We could add this as a function to the TreeSequence? (The naming is chosen to follow the arrays like nodes_time etc)

We can imagine having different options for weighting, at some point.

To do it really efficiently, we'd need to have to implement in C with an edge-diffs approach, but this implementation is probably quite fast for most purposes.

Although, I wonder if there a way of expressing the operation as a node stat?

Mar 08 '23 11:03 jeromekelleher

There's a more efficient implementation by simply adding up all the edge spans per parent node, and dividing by the span for which the parent node is in the tree:

    span_sums = np.zeros(ts.num_nodes)
    # Find the edge indices where parents change
    i = np.insert(np.nonzero(np.diff(ts.edges_parent))[0] + 1, 0, 0)
    span_sums[ts.edges_parent[i]] = np.add.reduceat(ts.edges_right - ts.edges_left, i)
    # Now divide by the node spans

This is insanely quick, but will be slowed down by the node span calculations, which need to be done using edge diffs. Here's the node span implementation that I thought might be generally useful (see https://github.com/tskit-dev/tskit/discussions/2718):

def node_spans_parent(ts, min_num_children=None):
    """
    Returns the spans over which each node is a parent of greater or equal to
    min_num_children children. 
    
    Note - to find the "general node spans", i.e. the spans over which nodes are present in all local
      trees, you may wish to include cases where nodes are children as well as parents. If all leaf
      nodes in your trees are sample nodes, as is usually the case, the general node spans can simply
      be found by setting sample node spans to the total sequence length (as by definition, samples
      are present in all trees in the tree sequence). This can be done as follows
      
          node_span = node_spans_parent(ts)
          node_span[ts.samples()] = ts.sequence_length
    
    
    Note - if you wish to ignore regions over which a node is a unary node, you can set 
        min_num_children = 2
    
    Note - this also counts "dead" branches
    """
    if min_num_children is None:
        min_num_children = 1
    num_children = np.zeros(ts.num_nodes, dtype=np.int32)
    span_start = np.zeros(ts.num_nodes)
    node_span = np.zeros(ts.num_nodes)

    for interval, edges_out, edges_in in ts.edge_diffs(include_terminal=True):
        touched=set()
        for edge in edges_out:
            num_children[edge.parent] -= 1
            if num_children[edge.parent] == min_num_children - 1:
            # Should no longer count this node
                node_span[edge.parent] += interval.left - span_start[edge.parent]

        for edge in edges_in:
            num_children[edge.parent] += 1
            if num_children[edge.parent] == min_num_children:
                span_start[edge.parent] = interval.left
    
    return node_span

Mar 08 '23 17:03 hyanwong

I believe you can also compute node spans as

node_spans = ts.sample_count_stat([ts.samples()], lambda x: (x > 0), 1, polarised=True, span_normalise=False, strict=False, mode='node')

Edit: I got polarised wrong the first time - should be True.

Mar 08 '23 17:03 petrelharp

This solution is very nice, @hyanwong. I think you can get span_sums as follows, somewhat more simply:

span_sums = np.bincount(ts.edges_parent, weights=ts.edges_right - ts.edges_left, minlength=ts.num_nodes)

Mar 08 '23 17:03 petrelharp

So, that'd be

def nodes_mean_arity2(ts):
    span_sums = np.bincount(ts.edges_parent, weights=ts.edges_right - ts.edges_left, minlength=ts.num_nodes)
    node_spans = ts.sample_count_stat([ts.samples()], lambda x: (x > 0), 1, polarised=True, span_normalise=False, strict=False, mode='node')[:,0]
    return span_sums / node_spans

... which seems to agree with Jerome's implementation!

Mar 08 '23 17:03 petrelharp

I believe you can also compute node spans ...

Oh, brilliant. That's a really neat thing to know. It would be fantastic to either document this node span one-liner somewhere, or add a simple wrapper function to return it?

Mar 08 '23 19:03 hyanwong

node_spans = ts.sample_count_stat ...

However, in a basic test, running @petrelharp's sample_count_stat approach on the large Covid tree seq (1.4M nodes, 1496 trees) took 10 mins on my laptop, whereas running the node_spans_parent function that I showed above took 21 seconds. So it might be worth having something like this in tskit anyway? I'm not sure if anything could be sped up e.g. using numba, but I think not, and also suspect that the sample_count_stat might already be quite optimised?

However, the bincount approach is about twice as fast on the covid ARG as my np.add.reduceat approach, although it's trivial in terms of the total time for the calc (6 milliseconds vs 12 milliseconds, on average!)

Mar 08 '23 20:03 hyanwong

What would be nice is the ability to slice the arity stat to average over genome region and/or timeslices, which would put it more into the stats framework, right?

Mar 08 '23 20:03 hyanwong

Super neat!

That won't be very efficient as it's calling back to python, but we could easily implement the node spans in C using this approach

Mar 08 '23 21:03 jeromekelleher

took 10 mins on my laptop, whereas running the node_spans_parent function that I showed above took 21 seconds.

Huh, that's confusing! I suppose it might possibly be due to the C code calling back out to the python summary function?!? Otherwise I don't see why that code should take a long time; we've not got high-dimesional input or output...

I do think this should be a method - there's other use cases for it (maybe there's even an issue for this already...)

Mar 08 '23 23:03 petrelharp

Huh, that's confusing!

Perhaps you could check on your machine, with a big msprime-generated tree sequence, Peter?

Mar 08 '23 23:03 hyanwong

Confusing! I suppose it might possibly be due to the C code calling back out to the python summary function?!? Otherwise I don't see why that code should take a long time; we've not got high-dimesional input or output...

This is definitely because the summary func is calling back to Python - I wouldn't expect this to be efficient, or even indicative of real performance. We would need to implement in C - but it's easy.

I agree we should make this stuff available as API methods, it's just a question of how to break things up and package.

Mar 09 '23 11:03 jeromekelleher

I'm happy to do that; I think the main question is what to call the function; weigh in over at #2725.

Mar 13 '23 17:03 petrelharp

Closing for inactivity and labelling "future", please re-open if you plan to work on this.

Jun 12 '25 22:06 benjeffery

tskit tskit copied to clipboard

Average arity of nodes in the trees

tskit
tskit copied to clipboard