bottleneck
bottleneck copied to clipboard
Inconsistencies in nansum for float32 dtype compared to numpy
Consider this simple example:
In [1]: import numpy as np
In [2]: import bottleneck as bn
In [3]: data = 2e5*np.random.rand(int(4e7)).astype('float32')
In [4]: np.nansum(data)
Out[4]: 4000034300000.0
In [5]: bn.nansum(data)
Out[5]: 3719060258816.0
Looks like errors in the computation are compounding due to loss of precision, as the problem becomes much less apparent for smaller datasets. Repeating the above for the float64
dtype gives me much more consistent results.
In [6]: bn.nansum(data.astype('float64'))
Out[6]: 4000035580557.9033
In [7]: np.nansum(data.astype('float64'))
Out[7]: 4000035580557.979
I tested this example for bottleneck 1.1.0 and 1.2.1
I think numpy uses a more robust algorithm: https://en.wikipedia.org/wiki/Pairwise_summation. Bottleneck doesn't.
Is bottleneck used at JPL?!
@kwgoodman Thanks for the quick response! To be precise, I have been using it indirectly through xarray for multiple JPL projects.
I take it that this is a well known issue then and this isn't going to be addressed anytime soon? I ended up discovering this when I wanted to calculate some statistics for a dataset that's similar in size to the one in the example I posted, and got some pretty serious errors. In this case, the dataset had a min and max that were very close to each other, but the mean ended up being lower than the min and consequently the standard deviation was over an order of magnitude larger than its actual value.
Happy to hear that bottleneck is used, even if indirectly, at JPL.
I'd be interested in trying pairwise summation in bn.nansum and bn.nanmean. I get paid for releases of numerox but have not found funding for bottleneck development.