bottleneck icon indicating copy to clipboard operation
bottleneck copied to clipboard

Possible to implement a Centered moving average?

Open CP-Vub opened this issue 6 years ago • 9 comments

Dear Devs, I have used your module in the past with great success in speeding up my code. However, I wonder if it is possible to implement or add perhaps a Boolean to the moving mean function that lets the user specify whether he wants a centered average or not, similar to: pandas.DataFrame.rolling(window=width,center=True).mean() Currently I am still using pandas for central moving averages but it is significantly slower than Bottlenecks functions unfortunately.

CP-Vub avatar Mar 15 '18 06:03 CP-Vub

I am happy to hear that bottleneck has been useful. You made my day.

I doubt I will be making your day because I am going to suggest a workaround which might not work for you. Could you shift the output to get what you want?

kwgoodman avatar Mar 15 '18 16:03 kwgoodman

We implemented the shifting solution in xarray to do centered moving averages on top of bottleneck. Try xarray.DataArray.rolling(..., center=True).mean()

shoyer avatar Mar 15 '18 17:03 shoyer

Oh! Nice.

kwgoodman avatar Mar 15 '18 17:03 kwgoodman

Here's the implementation (by @fujiisoup) for anyone that's curious: https://github.com/pydata/xarray/blob/e1dc51572e971567fd3562db0e9f662e3de80898/xarray/core/rolling.py#L281

shoyer avatar Mar 15 '18 17:03 shoyer

@kwgoodman Well bottleneck has reduced my algorithm's calculation times by quite a significant factor (approx an order of magnitude). For my current application, I'm calculating a moving mean over a couple of billion values, and even though pandas is not slow, bottleneck still manages to save me quite some time. So I can honestly say that you made my day when I discovered your module!

@shoyer Thank you for the suggestion, now I know about the existence of xarray as well! I also tried out the suggested method and compared it to bottleneck and pandas using some quickly written example code. However, I have the feeling the bottleneck module is not being recognized by xarray, since the computation times are way higher than for the bottleneck module. I've added the code I used to check this .

  • Code:

import numpy as np import pandas as pd import time import bottleneck as bn import xarray import matplotlib.pyplot as plt

N = 30000200 # Number of datapoints Fs = 30000 # sample rate T=1/Fs # sample period duration = N/Fs # duration in s t = np.arange(0,duration,T) # time vector DATA = np.random.randn(N,)+5np.sin(2np.pi0.01t) # Example noisy sine data and window size w = 330000

def using_bottleneck_mean(data,width): return bn.move_mean(a=data,window=width,min_count = 1)

def using_pandas_rolling_mean(data,width): return np.asarray(pd.DataFrame(data).rolling(window=width,center=True,min_periods=1).mean()).ravel()

def using_xarray_mean(data,width): return xarray.DataArray(data,dims='x').rolling(x=width,min_periods=1, center=True).mean()

start=time.time() A = using_bottleneck_mean(DATA,w) print('Bottleneck: ', time.time()-start, 's') start=time.time() B = using_pandas_rolling_mean(DATA,w) print('Pandas: ',time.time()-start,'s') start=time.time() C = using_xarray_mean(DATA,w) print('Xarray: ',time.time()-start,'s')

  • Output for a single iteration, time taken: Bottleneck: 0.033s Pandas: 0.255s Xarray: 5.217s

CP-Vub avatar Mar 15 '18 19:03 CP-Vub

@RayPalmerTech thanks for trying it out, and for the reproducible example! We will definitely add something like it to our benchmark suite.

Indeed this is way slower than it should be. With a simple fix (see https://github.com/pydata/xarray/issues/1993 for details) I get much more reasonable performance:

Bottleneck: 0.06775331497192383 s Pandas: 0.48262882232666016 s Xarray: 0.1723031997680664 s

shoyer avatar Mar 15 '18 20:03 shoyer

@RayPalmerTech if you don't need the output to be the same length as the input you can slice into the output, something like a[half_window:-half_window]. Slicing is very fast as there are no copies made.

kwgoodman avatar Mar 16 '18 22:03 kwgoodman

A workaround is to pad the array with NaNs in the filtering axis on the right side and remove the excess elements on the left side to emulate symmetric filtering. This works when no NaNs are expected in the original array. There is a ~30% overhead for it, but it's still blazingly fast compared to numpy or even scipy's medfilt2d and I'm able to replicate Matlab's medfilt1 with the 'truncate' option (only 'numpy' and 'bottleneck pad' give the same result). Here the times for a 10000 x 200 array and move_median with size 15:

Bottleneck: 0.106s
Bottleneck (pad): 0.134s
Scipy: 0.472s
Numpy: 1.033s

And here the code I used to generate the results:

import time
import numpy as np
from scipy.signal import medfilt2d
import bottleneck as bk

def _medfilt1_np(array, size):
    lsize = size // 2
    rsize = size - lsize
    return np.stack([np.median(array[:, max(i - lsize, 0):i + rsize], axis=1)
                     for i in range(array.shape[1])]).T

def _medfilt1_bk(array, size):
    rsize = size - size // 2 - 1
    array = np.pad(array, pad_width=((0, 0), (0, rsize)),
                   mode='constant', constant_values=np.nan)
    return bk.move_median(array, size, min_count=1, axis=1)[:, rsize:]

test = np.random.randn(10000, 200)

start = time.time()
res = bk.move_median(test, 15, min_count=1, axis=1)
print('Bottleneck: {:.4}s'.format(time.time() - start))

start = time.time()
res = _medfilt1_bk(test, 15)
print('Bottleneck (pad): {:.4}s'.format(time.time() - start))

start = time.time()
res = medfilt2d(test, (1, 15))
print('Scipy: {:.4}s'.format(time.time() - start))

start = time.time()
res = _medfilt1_np(test, 15)
print('Numpy: {:.4}s'.format(time.time() - start))

Banus avatar Jun 05 '20 18:06 Banus

Hi Mr Goodman,

I am relative new and inexperienced with python an came across your bottleneck Documentation, Release 1.4.0.dev0+102.g933b653 whilst searching for a fast moving average routine.

Is it OK for people to use your code freely.

Regards

CharalapML avatar Dec 22 '20 16:12 CharalapML