bottleneck
bottleneck copied to clipboard
Improved slow sum-of-squares
It looks like a job for np.einsum
def ss(a, axis=None):
"Slow sum of squares used for unaccelerated dtypes."
a = np.asarray(a)
input_subscripts = range(a.ndim)
output_subscripts = list(set(input_subscripts) - set([axis])) \
if axis is not None \
else []
y = np.einsum(a, input_subscripts, a, input_subscripts, output_subscripts)
return y
This does not handle NaNs, but they are not handled by the current version either. Actually, I have to admit that, on my machine, this version is twice faster than the accelerated version on 1000x1000 arrays, but I do not know how much general this result can be.
Wow! That is amazing.
The current version is just:
def ss(a, axis=None):
"Slow sum of squares used for unaccelerated dtypes."
a = np.asarray(a)
y = np.multiply(a, a).sum(axis)
return y
The slow version serves several purposes: unit testing (I prefer the current simple one for that); benchmarking (I assume, but could be wrong, that most users would do something like a*a as in the current version); handle unsupported dtypes (rare).
The bottleneck version is much faster for small arrays. For large arrays I wonder if there is C API access to einsum (not that I would code it up). At the least I should add in the docstring that np.einsum is much faster for large 2d arrays.
Alternatively we could use your version. It currently doesn't handle negative axes.
This seems to work (passes all bottleneck unit tests):
def ss(a, axis=None):
"Slow sum of squares used for unaccelerated dtypes."
a = np.asarray(a)
input_subscripts = range(a.ndim)
if axis is None:
output_subscripts = []
else:
if axis < 0:
axis = a.ndim + axis
if axis < 0:
raise ValueError("`axis` is out of range")
output_subscripts = list(set(input_subscripts) - set([axis]))
y = np.einsum(a, input_subscripts, a, input_subscripts, output_subscripts)
return y
Great! What about nansum? We have to handle NaNs by copying the array.
def nansum(a, axis=None):
"Slow nansum function used for unaccelerated dtype."
a = np.asarray(a)
mask = np.isnan(a)
if mask.any():
a = a.copy()
a[mask] = 0
input_subscripts = range(a.ndim)
if axis is None:
y = np.einsum(a, input_subscripts)
else:
if axis < 0:
axis = a.ndim + axis
if axis < 0:
raise ValueError("`axis` is out of range")
output_subscripts = list(set(input_subscripts) - set([axis]))
y = np.einsum(a, input_subscripts, output_subscripts)
return y
Please, note how I managed differently the "raveled" case. This is because I noted a severe performance drop when passing an empty list as output subscripts. Here are some timings on my machine:
In [1]: import bottleneck, numpy as np
In [2]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:def nansum(a, axis=None):
: "Slow nansum function used for unaccelerated dtype."
: a = np.asarray(a)
: mask = np.isnan(a)
: if mask.any():
: a = a.copy()
: a[mask] = 0
: input_subscripts = range(a.ndim)
: if axis is None:
: y = np.einsum(a, input_subscripts)
: else:
: if axis < 0:
: axis = a.ndim + axis
: if axis < 0:
: raise ValueError("`axis` is out of range")
: output_subscripts = list(set(input_subscripts) - set([axis]))
: y = np.einsum(a, input_subscripts, output_subscripts)
: return y
:<EOF>
In [3]: a = np.random.rand(1000,1000)
In [4]: %timeit nansum(a)
1000 loops, best of 3: 460 µs per loop
In [5]: %timeit nansum(a, 0)
1000 loops, best of 3: 863 µs per loop
In [6]: %timeit nansum(a, 1)
1000 loops, best of 3: 890 µs per loop
In [7]: %timeit bottleneck.nansum(a)
1000 loops, best of 3: 876 µs per loop
In [8]: %timeit bottleneck.nansum(a, 0)
1000 loops, best of 3: 1.1 ms per loop
In [9]: %timeit bottleneck.nansum(a, 1)
1000 loops, best of 3: 882 µs per loop
In [10]: a[30,40] = np.nan
In [11]: %timeit nansum(a)
1000 loops, best of 3: 1.32 ms per loop
In [12]: %timeit nansum(a, 0)
100 loops, best of 3: 1.91 ms per loop
In [13]: %timeit nansum(a, 1)
100 loops, best of 3: 1.83 ms per loop
In [14]: %timeit bottleneck.nansum(a)
1000 loops, best of 3: 875 µs per loop
In [15]: %timeit bottleneck.nansum(a, 0)
1000 loops, best of 3: 1.1 ms per loop
In [16]: %timeit bottleneck.nansum(a, 1)
1000 loops, best of 3: 883 µs per loop
I think the case for changing bn.slow.nansum is weaker than that for bn.slow.ss. For example, many users of nansum will know ahead of time if the array contains nans.
How does the speed of your einsum-based nansum compare to bn.slow.nansum which is just np.nansum?