tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Deprecate the (new) _array methods by allowing both parent(u) and parent[u] access?

Open hyanwong opened this issue 3 years ago • 11 comments
trafficstars

In #1320 we bemoaned the introduction of new Tree.parent_array[u], Tree.left_sib_array[u], etc. notation, which are the array-based (direct memory access) equivalent of Tree.parent(u), Tree.left_sib(u), etc. Nevertheless, this made it into tskit 0.3.6.

However, it seems like we can use the __call__ dunder method to determine what happens when we call class_instance(u). So we could have both Tree.parent(u) and Tree.parent[u]as valid access methods, by creating an overloaded numpy ndarray class, and returning a numpy array using this as a view:

class Ts_nparray(np.ndarray):
    # see https://numpy.org/doc/stable/user/basics.subclassing.html
    def __call__(self, index):
        return self[index]

a = np.arange(10, 20).view(Ts_nparray)
assert a(3) == a[3]

If this doesn't seem too hacky, we could thus have

class _ts_nparray(np.ndarray):
    # a simple wrapper class to allow the array to be accessed using arr(x) as well as arr[x]
    # see https://numpy.org/doc/stable/user/basics.subclassing.html
    def __call__(self, index):
        return self[index]  # could also convert this to a python int, for full backwards compatibility


class Tree
    ...
    @property 
    def parent(self):
        # get the numpy direct memory array here
        return numpy_direct_memory_array.view(_ts_nparray)

    @property 
    def parent_array(self):
        # Deprecated but could be maintained for backwards compatibility
        return self.parent()

That would mean all the following are equivalent:

tree.parent[0]  # Standard numpy access, recommended for new code
tree.parent(0)  # Old tskit syntax, deprecated but permanently maintained for backwards compatibility
tree.parent_array[0] # Recently introduced syntax (in 0.3.6), make deprecated, possibly could be removed later
tree.parent_array(0)  # Comes along for the ride(!)

And we would strongly recommend that new code uses the first (standard numpy) method. Using __call__ to keep the old behaviour too seems a bit hacky, but would mean that old code would (hopefully) switch seamlessly to the new.

hyanwong avatar Dec 05 '21 13:12 hyanwong

It might be worth some perf testing. If doing it this way makes Tree.parent(u) actually faster, then that seems a good reason to me to consider it.

hyanwong avatar Dec 05 '21 14:12 hyanwong

Great idea @hyanwong, very clever! I doubt many people are using Tree.parent_array, so we could consider just nuking it.

However, we do need to be careful with performance here. Counterintuitively, doing tree.parent(u) lots of times is faster than tree.parent_array[u], basically because the __getitem__ code path for a numpy array is very complicated. So, we'd want to keep the existing linkage with the _ll_tree methods.

We could do something like

class QuintuplyLInkedTreeArray(np.ndarray):
     
      def __call__(self, index):
           return self._get_method(index)

class Tree

    def __init__(self, ll_tree):
         ....
         self._parent = self.ll_tree.parent_array.view(QuintuplyLInkedTreeArray)
         self._parent._get_method = ll_tree.get_parent
         # Ideally we'd pass the get_method in with the contstructor, but not sure if that's possible.

    @property   
    def parent(self):
        return self._parent

This should be fully compatible and have no loss of perf then?

jeromekelleher avatar Dec 05 '21 15:12 jeromekelleher

However, we do need to be careful with performance here. Counterintuitively, doing tree.parent(u) lots of times is faster than tree.parent_array[u], basically because the __getitem__ code path for a numpy array is very complicated.

So I guess we want to say "if you are accessing elements of the array, you might want to use round braces for speed, rather than access using .parent[u]. Then we actually keep the round braces as the standard way of doing it? But have the advantage that we can index into the array using the square bracket syntax e.g. using a boolean array?

hyanwong avatar Dec 05 '21 15:12 hyanwong

It probably wouldn't matter the vast majority of the time, but yes, we'd end up doing this I'd say.

jeromekelleher avatar Dec 05 '21 15:12 jeromekelleher

Another thing we need to check as well before going through with this - does the subclassed numpy array still work well with numba? Can it get direct memory access from a view, or do we end up invoking the Python interpreter in order to unravel things?

jeromekelleher avatar Dec 05 '21 15:12 jeromekelleher

Taking this out of 0.4.0 for now, as I don't think we want to rush into it. The cat's out of the bag for the parent_array etc, so let's take it slowly and make sure this works as expected in high-performance cases.

jeromekelleher avatar Dec 05 '21 16:12 jeromekelleher

Here's a bit of a test for numba, based on one of your previous examples. It seems not to make much difference adding the __call__ method to the view, but I haven't tested other permutations

import msprime
import numba
import numpy as np
import tskit

ts1m = msprime.sim_ancestry(1e6, ploidy=1, random_seed=42)

ts = ts1m

class QuintuplyLinkedTreeArray(np.ndarray):
      def __call__(self, index):
           return self._get_method(index)


tree = ts.first()

time = ts.tables.nodes.time


@numba.njit()
def total_branch_length_numba(root):
    tbl = 0
    stack = [root]
    while len(stack) > 0:
        u = stack.pop()
        v = left_child[u]
        while v != tskit.NULL:
            tbl += time[u] - time[v]
            stack.append(v)
            v = right_sib[v]
    return tbl

@numba.njit()
def total_branch_length_numba_recursive(u):
    tbl = 0
    v = left_child[u]
    while v != tskit.NULL:
        tbl += (time[u] - time[v]) + total_branch_length_numba_recursive(v)
        v = right_sib[v]
    return tbl
Screenshot 2021-12-05 at 18 04 04

hyanwong avatar Dec 05 '21 18:12 hyanwong

It's not obvious that this will do what it looks like it should do, numba could be doing something odd with the references.

But, it looks good and I think we should make the change once we've done the due diligence in making sure we're not dropping any performance.

Let's pick this up again once 0.4.0 is released and we're thinking about numba on arrays again for the paper.

jeromekelleher avatar Dec 06 '21 08:12 jeromekelleher

Interesting, thanks @hyanwong will do a proper think about this after 0.4.0

benjeffery avatar Dec 06 '21 12:12 benjeffery

I wonder if it is time to revisit this, before people start using Tree.parent_array in earnest?

hyanwong avatar Mar 08 '22 23:03 hyanwong

I'm not particularly motivated tbh, there's a lot of critical path stuff we need to get done for the tskit paper first.

jeromekelleher avatar Mar 09 '22 08:03 jeromekelleher