xarray icon indicating copy to clipboard operation
xarray copied to clipboard

`.sel()` fails on `datetime64[s]` object

Open oloapinivad opened this issue 1 month ago • 5 comments

What happened?

Hi there,

sorry if this might be a duplicate, but I have been browsing the repo without finding anything specific which resemble this.

So, I am exploring to the possibility of calling xarray with CFDatetimeDecoder on time period overshooting pandas threshold year 2262 Running with xarray=2025.9.0

import xarray as xr
files="/ec/res4/scratch/ecme3497/ece4/pic2/output/oifs/pic2_atm_cmip6_1m_24*.nc"
coder = xr.coders.CFDatetimeCoder(time_unit='s')
data = xr.open_mfdataset(files, decode_times=coder)
data.time_counter

which gives me

array(['2400-01-16T12:00:00', '2400-02-15T12:00:00', '2400-03-16T12:00:00',
       ..., '2489-10-16T12:00:00', '2489-11-16T00:00:00',
       '2489-12-16T12:00:00'], shape=(1080,), dtype='datetime64[s]')

so far so good. However

data.sel(time_counter=slice("2400-01-01", "2420-01-01"))
OverflowError                             Traceback (most recent call last)
File pandas/_libs/tslibs/period.pyx:1169, in pandas._libs.tslibs.period.period_ordinal_to_dt64()

OverflowError: Overflow occurred in npy_datetimestruct_to_datetime

The above exception was the direct cause of the following exception:

OutOfBoundsDatetime                       Traceback (most recent call last)
Cell In[8], [line 1](vscode-notebook-cell:?execution_count=8&line=1)
----> [1](vscode-notebook-cell:?execution_count=8&line=1) data.sel(time_counter=slice("2400-01-01", "2420-01-01"))

File /ECMWF_kQYjfeo/miniforge/envs/env1/lib/python3.12/site-packages/xarray/core/dataset.py:2974, in Dataset.sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
   2906 """Returns a new dataset with each array indexed by tick labels
   2907 along the specified dimension(s).
   2908 
   (...)   2971 
   2972 """
   2973 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> [2974](https://vscode-remote+ssh-002dremote-002bhpc-002dlogin.vscode-resource.vscode-cdn.net/ECMWF_kQYjfeo/miniforge/envs/env1/lib/python3.12/site-packages/xarray/core/dataset.py:2974) query_results = map_index_queries(
   2975     self, indexers=indexers, method=method, tolerance=tolerance
   2976 )
   2978 if drop:
   2979     no_scalar_variables = {}
...
File pandas/_libs/tslibs/period.pyx:1992, in pandas._libs.tslibs.period._Period.to_timestamp()

File pandas/_libs/tslibs/period.pyx:1172, in pandas._libs.tslibs.period.period_ordinal_to_dt64()

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2400-01-01 00:00:00

I know this can be fixed using cftime, but I though that using datetime64[s] would have fix this issue. In general, I am quite confused on what are the currently supported features of xarray for going beyond the the nanoseconds limitation of pandas. Shall we keep working with cftime? Is there something I am missing? To what extent can we used the Coder if this is the output? Any help, also suggestion to documentation, is greatly appreciated! Thanks a lot

EDIT: I can upload some of the data required, but below you can find a reproducible example

What did you expect to happen?

As expected, correct time selection.

Minimal Complete Verifiable Example

import numpy as np
import xarray as xr

# monthly starts from 2389-01 to 2489-12 (inclusi)
months = np.arange("2389-01", "2490-01", dtype="datetime64[M]")  # dtype M = month starts
times = months.astype("datetime64[s]")  # convert to seconds (YYYY-MM-01T00:00:00)
# creare DataArray di esempio
da = xr.DataArray(np.zeros(times.size), coords={"time": times}, dims=["time"])
da.sel(time=slice("2400-01-01", "2420-01-01"))

I have been able to replicate this also in most recent version of xarray

>>> xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.14.0 | packaged by conda-forge | (main, Dec  2 2025, 20:23:19) [Clang 20.1.8 ]
python-bits: 64
OS: Darwin
OS-release: 23.6.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: ('it_IT', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2025.11.0
pandas: 2.3.3
numpy: 2.3.5
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
zarr: None
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: None
pip: 25.3
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Environment

xarray>=2025.9.0

oloapinivad avatar Dec 03 '25 09:12 oloapinivad

I can reproduce this with just

import pandas as pd

dates = pd.date_range("2400-01-15T12:00:00", freq="ME", periods=10, unit="s")
dates.slice_locs("2400-01-01", "2400-06-01")

which means we inherit this behavior from pandas.DatetimeIndex. Good news is that you can work around this by wrapping your timestamps in pd.Timestamp(t, unit="s").

cc @spencerkclark

keewis avatar Dec 03 '25 10:12 keewis

Ah! that's curious! Actually this works too, no need to specify unit="s":

import numpy as np
import xarray as xr
import pandas as pd

months = np.arange("2389-01", "2490-01", dtype="datetime64[M]")  # dtype M = month starts
times = months.astype("datetime64[s]")  # convert to seconds (YYYY-MM-01T00:00:00)
da = xr.DataArray(np.zeros(times.size), coords={"time": times}, dims=["time"])
da.sel(time=slice(pd.Timestamp("2400-01-01"), pd.Timestamp("2420-01-01"))).time

oloapinivad avatar Dec 03 '25 10:12 oloapinivad

Correct, this is an existing issue in pandas: https://github.com/pandas-dev/pandas/issues/56940. You could consider pinging it to get a sense for where it is on their roadmap.

It may not be a concern for your use-case, but I would offer a slight word of caution regarding using the pd.Timestamp workaround, since it has a different meaning than indexing with strings (it is not an exact drop-in replacement). With strings, pandas will implicitly expand the bounds to encompass the largest possible range of strings, while with pd.Timestamp objects, pandas will use the literal values as bounds. See this section of the pandas documentation for more details.

To the larger question of full non-nanosecond np.datetime64 support, indeed this is still a work in progress in pandas. There is a general tracking issue here: https://github.com/pandas-dev/pandas/issues/46587. A large number of the issues are resolved, though I am not sure how comprehensive or up-to-date it is (e.g., it does not seem to include this issue). In the bigger picture, cftime support is not being deprecated. Independent of the time range issue, there is still a need for non-standard calendar support, so if that supports the features you require, it is safe to continue using that.

spencerkclark avatar Dec 03 '25 11:12 spencerkclark

Correct, this is an existing issue in pandas: pandas-dev/pandas#56940. You could consider pinging it to get a sense for where it is on their roadmap.

Thanks will do!

It may not be a concern for your use-case, but I would offer a slight word of caution regarding using the pd.Timestamp workaround, since it has a different meaning than indexing with strings (it is not an exact drop-in replacement). With strings, pandas will implicitly expand the bounds to encompass the largest possible range of strings, while with pd.Timestamp objects, pandas will use the literal values as bounds. See this section of the pandas documentation for more details.

So this why from the xarray side you do not force conversion toward pd.Timestamp and you get this error. At first I understood this was due to some missing parsing on your side but now I see

To the larger question of full non-nanosecond np.datetime64 support, indeed this is still a work in progress in pandas. There is a general tracking issue here: pandas-dev/pandas#46587. A large number of the issues are resolved, though I am not sure how comprehensive or up-to-date it is (e.g., it does not seem to include this issue). In the bigger picture, cftime support is not being deprecated. Independent of the time range issue, there is still a need for non-standard calendar support, so if that supports the features you require, it is safe to continue using that.

In our case we https://github.com/DestinE-Climate-DT/AQUA we rely on pandas for time operations, so that using cftime is not our first choice, but thanks for the comprehensive explanation.

Shall I close the issue or do you want to leave it as backlog?

oloapinivad avatar Dec 04 '25 07:12 oloapinivad

Thanks, I see, that makes sense. You are welcome to keep this issue open, since I suspect others may run into this same problem and may wonder about the status / underlying cause.

In doing a little more investigation on the pandas side, I think addressing https://github.com/pandas-dev/pandas/issues/28104, which is on the general checklist, might help solve this one as well (though I think there would need to be a way to propagate the resolution for determining the appropriate end bound).

spencerkclark avatar Dec 04 '25 12:12 spencerkclark