bottleneck Inconsistencies in nansum for float32 dtype compared to numpy

Inconsistencies in nansum for float32 dtype compared to numpy

Open agoodm opened this issue 6 years ago • 3 comments

Consider this simple example:

In [1]: import numpy as np

In [2]: import bottleneck as bn

In [3]: data = 2e5*np.random.rand(int(4e7)).astype('float32')

In [4]: np.nansum(data)
Out[4]: 4000034300000.0

In [5]: bn.nansum(data)
Out[5]: 3719060258816.0

Looks like errors in the computation are compounding due to loss of precision, as the problem becomes much less apparent for smaller datasets. Repeating the above for the float64 dtype gives me much more consistent results.

In [6]: bn.nansum(data.astype('float64'))
Out[6]: 4000035580557.9033

In [7]: np.nansum(data.astype('float64'))
Out[7]: 4000035580557.979

I tested this example for bottleneck 1.1.0 and 1.2.1

Aug 15 '18 21:08 agoodm

I think numpy uses a more robust algorithm: https://en.wikipedia.org/wiki/Pairwise_summation. Bottleneck doesn't.

Is bottleneck used at JPL?!

Aug 15 '18 21:08 kwgoodman

@kwgoodman Thanks for the quick response! To be precise, I have been using it indirectly through xarray for multiple JPL projects.

I take it that this is a well known issue then and this isn't going to be addressed anytime soon? I ended up discovering this when I wanted to calculate some statistics for a dataset that's similar in size to the one in the example I posted, and got some pretty serious errors. In this case, the dataset had a min and max that were very close to each other, but the mean ended up being lower than the min and consequently the standard deviation was over an order of magnitude larger than its actual value.

Aug 15 '18 22:08 agoodm

Happy to hear that bottleneck is used, even if indirectly, at JPL.

I'd be interested in trying pairwise summation in bn.nansum and bn.nanmean. I get paid for releases of numerox but have not found funding for bottleneck development.

Aug 15 '18 22:08 kwgoodman

bottleneck bottleneck copied to clipboard

Inconsistencies in nansum for float32 dtype compared to numpy

bottleneck
bottleneck copied to clipboard