.min() doesn't work on np.datetime64 with a chunked Dataset
Hi all,
if a xr.Dataset is chunked, i cannot do ds.time.min(), i get an error : ufunc 'add' cannot use operands with types dtype('<M8[ns]') and dtype('<M8[ns]'). I don't know if it is expected ?
Moreover, ds2.time.mean() works
Thanks
What happened:
raised an UFuncTypeError: ufunc 'add' cannot use operands with types dtype('<M8[ns]') and dtype('<M8[ns]')
What you expected to happen:
compute the min & max on a chunked datetime64 xarray.DataArray
Minimal Complete Verifiable Example:
import xarray as xr
import numpy as np
obs=200
t0 = np.datetime64("2010-01-01T00:00:00")
tn = t0 + np.timedelta64(123*4, "D")
ds2 = xr.Dataset(
{
"time": (["obs"], np.arange(t0, tn, (tn-t0)/obs)),
},
coords={
"obs": (["obs"], np.arange(obs)),
},
).chunk({"obs": 100})
ds2.time.min()
Anything else we need to know?:
ds2.time.mean() works, max & min raise Exception
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None python: 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.15.0-133-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8 libhdf5: 1.12.0 libnetcdf: 4.7.4
xarray: 0.16.2 pandas: 1.2.1 numpy: 1.19.5 scipy: 1.6.0 netCDF4: 1.5.5.1 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.6.1 cftime: 1.3.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.01.1 distributed: 2021.01.1 matplotlib: 3.3.4 cartopy: None seaborn: None numbagg: None pint: 0.16.1 setuptools: 52.0.0.post20210125 pip: 20.3.3 conda: None pytest: 6.2.2 IPython: 7.20.0 sphinx: 3.5.0
core.duck_array_ops.mean seems to have a custom wrapper for datetime arrays.
It should not be a problem to generalize this to min and max as well.
Maybe there a more generic wrapper would be the best solution?
Yeah that's a good idea. We should check whether dask & numpy supports this now.
Yes, adding a custom wrapper derived from core.duck_array_ops.mean sounds like a good approach. It would allow handling datetime arrays for operators other than .mean().
I ran into the same issue when applying .median() or .str() to an xarray.core.resample.DataSetResample object. In my case, I need to apply these operations to the dataset variables while still keeping meaningful values for the time coordinate — for example, retaining a representative timestamp (such as the mean time) even when using .median() or .str() on the variables.
I experimented with building a wrapper similar to core.duck_array_ops.mean, and it works for my use case:
# median = _create_nan_agg_method("median", invariant_0d=True)
# median.numeric_only = True
_median = _create_nan_agg_method("median", invariant_0d=True)
def median(array, axis=None, skipna=None, **kwargs):
"""inhouse median that can handle np.datetime64 or cftime.datetime
dtypes"""
from xarray.core.common import _contains_cftime_datetimes
array = asarray(array)
if dtypes.is_datetime_like(array.dtype):
dmin = _datetime_nanreduce(array, min).astype("datetime64[Y]").astype(int)
dmax = _datetime_nanreduce(array, max).astype("datetime64[Y]").astype(int)
offset = (
np.array((dmin + dmax) // 2).astype("datetime64[Y]").astype(array.dtype)
)
# From version 2025.01.2 xarray uses np.datetime64[unit], where unit
# is one of "s", "ms", "us", "ns".
# To not have to worry about the resolution, we just convert the output
# to "timedelta64" (without unit) and let the dtype of offset take precedence.
# This is fully backwards compatible with datetime64[ns].
return (
_median(
datetime_to_numeric(array, offset), axis=axis, skipna=skipna, **kwargs
).astype("timedelta64")
+ offset
)
elif _contains_cftime_datetimes(array):
offset = min(array)
timedeltas = datetime_to_numeric(array, offset, datetime_unit="us")
mean_timedeltas = _median(timedeltas, axis=axis, skipna=skipna, **kwargs)
return _to_pytimedelta(mean_timedeltas, unit="us") + offset
else:
return _median(array, axis=axis, skipna=skipna, **kwargs)
median.numeric_only = True # type: ignore[attr-defined]
I’m not a very advanced Python programmer, so I’m sure there are cleaner and more robust/generic ways to solve this. But I hope this example helps illustrate the need for a general wrapper that supports datetime handling for other reduction operations for xarray.core.resample.DataArrayResample or xarray.core.groupby.DatasetGroupBy objects.
For min and max it's possible to add support within dask. They shouldn't require special handling on the xarray side.
I think this can be closed now that the dask PR is in and released.