xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Disable bottleneck by default?

Open dcherian opened this issue 2 years ago • 11 comments
trafficstars

What is your issue?

Our choice to enable bottleneck by default results in quite a few issues about numerical stability and funny dtype behaviour: #7336, #7128, #2370, #1346 (and probably more)

Shall we disable it by default?

dcherian avatar Dec 01 '22 17:12 dcherian

I kinda think correctness by default is more important than performance, especially if the default performance isn't prohibitively slow.

TomNicholas avatar Dec 01 '22 19:12 TomNicholas

I'd be fine with disabling, since bottleneck doesn't seem to be actively maintained.

Though I would say it's numerically unstable rather than incorrect! I would always want it enabled, but it does make sense to default to the conservative option.

I had dreams of making numbagg into a better bottleneck — it's just as fast, much more flexible, and integrates really well with xarray. But those dreams have not come to pass (yet!).

max-sixty avatar Dec 01 '22 22:12 max-sixty

The case where Bottleneck really makes a difference was for moving window statistics, where it uses a smarter algorithm than our current NumPy implementation, which creating a moving window view.

Otherwise, I agree, it probably isn't worth the trouble.

That said -- we could also switch to smarter NumPy based algorithms to implement most moving window calculations, e.g,. using np.nancumsum for moving window means.

shoyer avatar Dec 04 '22 01:12 shoyer

I fully agree that correctness is the priority. Note however that some functions now require bottleneck, like ffill and bfill (I am not sure if there are more cases). They may need to be modified so they can run without bottleneck.

markelg avatar Dec 14 '22 13:12 markelg

I think it's OK to still require bottleneck for ffill and bfill:

  1. There are no numerical concerns: these functions simply repeat numbers forward (or backwards).
  2. There is no good alternative to using a loop, and writing the loop in NumPy would be probitively slow.

shoyer avatar Dec 14 '22 18:12 shoyer

Maybe it is just a problem of documenting it more clearly?

headtr1ck avatar Mar 10 '23 20:03 headtr1ck

I want to add a +1 to disable it by default. It's pretty common to be using float32 precision arrays. I have a rolling mean operation early on in some code and the errors balloon over time in subsequent processes. This was a super obscure bug to track down as well.

riley-brady avatar May 24 '23 23:05 riley-brady

Though I would say it's numerically unstable rather than incorrect! I would always want it enabled, but it does make sense to default to the conservative option.

I recently stumbled across a case where bottleneck gives a completely incorrect answer or segfaults when taking the max of a small array (similar to: https://github.com/pydata/bottleneck/issues/381 - a fix was proposed but never merged).

I think it's OK to still require bottleneck for ffill and bfill:

As of #8389 bottleneck is no longer used for ffill and bfill by default. Given the issues w/ precision and correctness, is there any remaining reason to not disable it by default? If any remaining functions require bottleneck, it would make sense to enable bottleneck by default for those functions only.

bqi343 avatar Mar 26 '24 20:03 bqi343

I would support disabling bottleneck by default for now, and eventually permanently removing support for bottleneck. Numbagg seems to be a more than capable and much more maintainable replacement.

shoyer avatar Mar 26 '24 22:03 shoyer

As an update — numbagg is in a pretty good place now, and xarray defaults to using numbagg over bottleneck for overlapping functions when both are installed.

(I didn't finish my fuzzing work in numbagg, so I'm not as confident as I'd hope to be, but also no one has reported anything for a long time, so the "fuzzing from the world" has been successful...)

Numbagg doesn't do all of bottlenecks functions — there are functions that aren't practical to write with numba, since they require specialized data structures, such as moving max.

...though I'm not sure if that updates us towards or away from disabling it by default — the substitute is better than it used to be, but also the substitute overrides many of its functions by default when installed...

max-sixty avatar Mar 26 '24 22:03 max-sixty

...though I'm not sure if that updates us towards or away from disabling it by default — the substitute is better than it used to be, but also the substitute overrides many of its functions by default when installed...

If I don't have numbagg installed but happen to have bottleneck installed, I would prefer for the default behavior to be using np.nanmax instead of the incorrect bottleneck.nanmax from the issue I linked above.

bqi343 avatar Mar 27 '24 00:03 bqi343