xarray
xarray copied to clipboard
Disable bottleneck by default?
What is your issue?
Our choice to enable bottleneck by default results in quite a few issues about numerical stability and funny dtype behaviour: #7336, #7128, #2370, #1346 (and probably more)
Shall we disable it by default?
I kinda think correctness by default is more important than performance, especially if the default performance isn't prohibitively slow.
I'd be fine with disabling, since bottleneck doesn't seem to be actively maintained.
Though I would say it's numerically unstable rather than incorrect! I would always want it enabled, but it does make sense to default to the conservative option.
I had dreams of making numbagg into a better bottleneck — it's just as fast, much more flexible, and integrates really well with xarray. But those dreams have not come to pass (yet!).
The case where Bottleneck really makes a difference was for moving window statistics, where it uses a smarter algorithm than our current NumPy implementation, which creating a moving window view.
Otherwise, I agree, it probably isn't worth the trouble.
That said -- we could also switch to smarter NumPy based algorithms to implement most moving window calculations, e.g,. using np.nancumsum for moving window means.
I fully agree that correctness is the priority. Note however that some functions now require bottleneck, like ffill and bfill (I am not sure if there are more cases). They may need to be modified so they can run without bottleneck.
I think it's OK to still require bottleneck for ffill and bfill:
- There are no numerical concerns: these functions simply repeat numbers forward (or backwards).
- There is no good alternative to using a loop, and writing the loop in NumPy would be probitively slow.
Maybe it is just a problem of documenting it more clearly?
I want to add a +1 to disable it by default. It's pretty common to be using float32 precision arrays. I have a rolling mean operation early on in some code and the errors balloon over time in subsequent processes. This was a super obscure bug to track down as well.
Though I would say it's numerically unstable rather than incorrect! I would always want it enabled, but it does make sense to default to the conservative option.
I recently stumbled across a case where bottleneck gives a completely incorrect answer or segfaults when taking the max of a small array (similar to: https://github.com/pydata/bottleneck/issues/381 - a fix was proposed but never merged).
I think it's OK to still require bottleneck for ffill and bfill:
As of #8389 bottleneck is no longer used for ffill and bfill by default. Given the issues w/ precision and correctness, is there any remaining reason to not disable it by default? If any remaining functions require bottleneck, it would make sense to enable bottleneck by default for those functions only.
I would support disabling bottleneck by default for now, and eventually permanently removing support for bottleneck. Numbagg seems to be a more than capable and much more maintainable replacement.
As an update — numbagg is in a pretty good place now, and xarray defaults to using numbagg over bottleneck for overlapping functions when both are installed.
(I didn't finish my fuzzing work in numbagg, so I'm not as confident as I'd hope to be, but also no one has reported anything for a long time, so the "fuzzing from the world" has been successful...)
Numbagg doesn't do all of bottlenecks functions — there are functions that aren't practical to write with numba, since they require specialized data structures, such as moving max.
...though I'm not sure if that updates us towards or away from disabling it by default — the substitute is better than it used to be, but also the substitute overrides many of its functions by default when installed...
...though I'm not sure if that updates us towards or away from disabling it by default — the substitute is better than it used to be, but also the substitute overrides many of its functions by default when installed...
If I don't have numbagg installed but happen to have bottleneck installed, I would prefer for the default behavior to be using np.nanmax instead of the incorrect bottleneck.nanmax from the issue I linked above.