bottleneck icon indicating copy to clipboard operation
bottleneck copied to clipboard

Improved slow sum-of-squares

Open corrado9999 opened this issue 6 years ago • 6 comments

It looks like a job for np.einsum

def ss(a, axis=None):
    "Slow sum of squares used for unaccelerated dtypes."
    a = np.asarray(a)
    input_subscripts = range(a.ndim)
    output_subscripts = list(set(input_subscripts) - set([axis])) \
                        if axis is not None \
                        else []
    y = np.einsum(a, input_subscripts, a, input_subscripts, output_subscripts)
    return y

This does not handle NaNs, but they are not handled by the current version either. Actually, I have to admit that, on my machine, this version is twice faster than the accelerated version on 1000x1000 arrays, but I do not know how much general this result can be.

corrado9999 avatar Apr 05 '18 16:04 corrado9999

Wow! That is amazing.

The current version is just:

def ss(a, axis=None):
    "Slow sum of squares used for unaccelerated dtypes."
    a = np.asarray(a)
    y = np.multiply(a, a).sum(axis)
    return y

The slow version serves several purposes: unit testing (I prefer the current simple one for that); benchmarking (I assume, but could be wrong, that most users would do something like a*a as in the current version); handle unsupported dtypes (rare).

The bottleneck version is much faster for small arrays. For large arrays I wonder if there is C API access to einsum (not that I would code it up). At the least I should add in the docstring that np.einsum is much faster for large 2d arrays.

kwgoodman avatar Apr 08 '18 23:04 kwgoodman

Alternatively we could use your version. It currently doesn't handle negative axes.

kwgoodman avatar Apr 10 '18 00:04 kwgoodman

This seems to work (passes all bottleneck unit tests):



def ss(a, axis=None):
    "Slow sum of squares used for unaccelerated dtypes."
    a = np.asarray(a)
    input_subscripts = range(a.ndim)
    if axis is None:
        output_subscripts = []
    else:
        if axis < 0:
            axis = a.ndim + axis
            if axis < 0:
                raise ValueError("`axis` is out of range")
        output_subscripts = list(set(input_subscripts) - set([axis]))
    y = np.einsum(a, input_subscripts, a, input_subscripts, output_subscripts)
    return y

kwgoodman avatar Apr 10 '18 00:04 kwgoodman

Great! What about nansum? We have to handle NaNs by copying the array.

def nansum(a, axis=None):
    "Slow nansum function used for unaccelerated dtype."
    a = np.asarray(a)
    mask = np.isnan(a)
    if mask.any():
        a = a.copy()
        a[mask] = 0
    input_subscripts = range(a.ndim)
    if axis is None:
        y = np.einsum(a, input_subscripts)
    else:
        if axis < 0:
            axis = a.ndim + axis
            if axis < 0:
                raise ValueError("`axis` is out of range")
        output_subscripts = list(set(input_subscripts) - set([axis]))
        y = np.einsum(a, input_subscripts, output_subscripts)
    return y

Please, note how I managed differently the "raveled" case. This is because I noted a severe performance drop when passing an empty list as output subscripts. Here are some timings on my machine:

In [1]: import bottleneck, numpy as np

In [2]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:def nansum(a, axis=None):
:    "Slow nansum function used for unaccelerated dtype."
:    a = np.asarray(a)
:    mask = np.isnan(a)
:    if mask.any():
:        a = a.copy()
:        a[mask] = 0
:    input_subscripts = range(a.ndim)
:    if axis is None:
:        y = np.einsum(a, input_subscripts)
:    else:
:        if axis < 0:
:            axis = a.ndim + axis
:            if axis < 0:
:                raise ValueError("`axis` is out of range")
:        output_subscripts = list(set(input_subscripts) - set([axis]))
:        y = np.einsum(a, input_subscripts, output_subscripts)
:    return y
:<EOF>

In [3]: a = np.random.rand(1000,1000)

In [4]: %timeit nansum(a)
1000 loops, best of 3: 460 µs per loop

In [5]: %timeit nansum(a, 0)
1000 loops, best of 3: 863 µs per loop

In [6]: %timeit nansum(a, 1)
1000 loops, best of 3: 890 µs per loop

In [7]: %timeit bottleneck.nansum(a)
1000 loops, best of 3: 876 µs per loop

In [8]: %timeit bottleneck.nansum(a, 0)
1000 loops, best of 3: 1.1 ms per loop

In [9]: %timeit bottleneck.nansum(a, 1)
1000 loops, best of 3: 882 µs per loop

In [10]: a[30,40] = np.nan

In [11]: %timeit nansum(a)
1000 loops, best of 3: 1.32 ms per loop

In [12]: %timeit nansum(a, 0)
100 loops, best of 3: 1.91 ms per loop

In [13]: %timeit nansum(a, 1)
100 loops, best of 3: 1.83 ms per loop

In [14]: %timeit bottleneck.nansum(a)
1000 loops, best of 3: 875 µs per loop

In [15]: %timeit bottleneck.nansum(a, 0)
1000 loops, best of 3: 1.1 ms per loop

In [16]: %timeit bottleneck.nansum(a, 1)
1000 loops, best of 3: 883 µs per loop

corrado9999 avatar Apr 10 '18 09:04 corrado9999

I think the case for changing bn.slow.nansum is weaker than that for bn.slow.ss. For example, many users of nansum will know ahead of time if the array contains nans.

kwgoodman avatar Apr 10 '18 20:04 kwgoodman

How does the speed of your einsum-based nansum compare to bn.slow.nansum which is just np.nansum?

kwgoodman avatar Apr 12 '18 16:04 kwgoodman