cubed icon indicating copy to clipboard operation
cubed copied to clipboard

np.nanmean executes eagerly on cubed arrays but lazily on dask arrays

Open TomNicholas opened this issue 2 years ago • 4 comments

See image for demonstration.

Screenshot from 2023-03-14 19-26-13

np.nanmean is called by xarray's .mean() method when skipna=True, which is the default.

TomNicholas avatar Mar 14 '23 23:03 TomNicholas

Dask supports the __array_ufunc__ and __array_function__ protocols which is why it works lazily in this case. Cubed doesn't, and nanmean is not in the Array API, so np.nanmean will coerce the Cubed array to a NumPy array eagerly.

There's a bit of a discussion about the nanfunctions in https://github.com/data-apis/array-api/issues/170, including workarounds, but I'm not sure if that would help you @TomNicholas?

tomwhite avatar Mar 15 '23 12:03 tomwhite

nanmean is not in the Array API

Yeah I noticed that - what cubed does here is totally reasonable. I wasn't sure whether to raise this on cubed or xarray.

There's a bit of a discussion about the nanfunctions in https://github.com/data-apis/array-api/issues/170,

Thanks! This comment mentions using the where kwarg to mean, which I presume is something we could change xarray to do instead of calling nanmean (though I expect nanmean is probably more performant?). The where kwarg also isn't in the Array API standard either though by the looks of it. They say there is another discussion about nan-aware functions but I can't find it by searching that repo :confused:

TomNicholas avatar Mar 15 '23 15:03 TomNicholas

see https://github.com/pydata/xarray/pull/7424

dcherian avatar Mar 15 '23 15:03 dcherian

I had a look at how scikit-learn has implemented nan functions, e.g. nanmin:

https://github.com/scikit-learn/scikit-learn/blob/83a8015702213b39510a0f4898bc6879bcdf3ac2/sklearn/utils/_array_api.py#L489-L503

An approach like this would work for Xarray and Cubed, but it wouldn't be very performant as there are multiple rounds of reductions, which Cubed can't optimise well yet. (Note that #284 doesn't really cover the case of optimising one reduction after another).

So I think it would be simpler to add the nan functions to Cubed. I would suggest adding them to cubed.array, rather than cubed.array_api since they haven't been standardised yet. In fact, it might be worth deprecating cubed.array_api in favour of cubed.array, since it is fine for an array implementation to be a superset of the standard.

tomwhite avatar Sep 12 '23 10:09 tomwhite