xarray Interoperability with Pandas 2.0 non-nanosecond datetime

Is your feature request related to a problem?

As mentioned in this post on the Pangeo discourse, Pandas 2.0 will fully support non-nanosecond datetime as indices. The motivation for this work was the paleogeosciences; a community who needs to represent time in millions of years. One of the biggest motivator is also to facilitate paleodata - model comparison. Enter xarray!

Below is a snippet of code to create a Pandas Series with a non-nanosecond datetime and export to xarray (this works). However, most of the interesting functionalities of xarray don't seem to support this datetime out-of-box:

import pandas as pd
import xarray as xr

pds = pd.Series([10, 12, 11, 9], index=np.array(['-2000-01-01', '-2005-01-01', '-2008-01-01', '-2009-01-01']).astype('M8[s]'))
xra = pds.to_xarray()
xra.plot() #matplotlib error
xra.sel(index='-2009-01-01', method='nearest')

To test, you will need the Pandas nightly built:

pip uninstall pandas -y
pip install --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple pandas>1.9

Describe the solution you'd like

Work towards an integration of the new datetimes with xarray, which will support users beyond the paleoclimate community

Describe alternatives you've considered

No response

Additional context

No response

Jan 30 '23 19:01 khider

Hi @khider , thanks for raising this.

For those of us who haven't tried to use non-nanosecond datetimes before (e.g. me), could you possibly expand a bit more on

However, most of the interesting functionalities of xarray don't seem to support this datetime out-of-box:

specifically, where are errors being thrown from within xarray? And what functions are you referring to as examples?

Jan 30 '23 21:01 TomNicholas

we are casting everything back to datetime64[ns] when creating xarray objects, for example, so the only way to even get a non-nanosecond datetime variable is (or was, we might have fixed that?) through the zarr backend (though that would / might fail elsewhere).

@spencerkclark knows much more about this, but in any case we're aware of the change and are working it (see e.g. #7441). (To be fair, though, at the moment it is mostly Spencer who's working on it, and he seems to be pretty preoccupied.)

Jan 30 '23 21:01 keewis

Thanks for posting this general issue @khider. This is something that has been on my radar for several months and I'm on board with it being great to support (eventually this will likely help cftime support as well).

I might hesitate to say that I'm actively working on it yet 😬. Right now, in the time I have available, I'm mostly trying to make sure that xarray's existing functionality does not break under pandas 2.0. Once things are a little more stable in pandas with regard to this new feature my plan is to take a deeper dive into what it will take to adopt in xarray (some aspects might need to be handled delicately). We can plan on using this issue for more discussion.

As @keewis notes, xarray currently will cast any non-nanosecond precision datetime64 or timedelta64 values that are introduced to nanosecond-precision versions. This casting machinery goes through pandas, however, and I haven't looked carefully into how this is behaving/is expected to behave under pandas 2.0. @khider based on your nice example it seems that it is possible for non-nanosecond-precision values to slip through, which is something we may need to think about addressing for the time being.

Jan 31 '23 02:01 spencerkclark

Hi all,

Thank you for looking into this. I was very excited when the array was created from my non-nanosecond datetime index but I couldn't do much manipulations beyond creation.

Jan 31 '23 03:01 khider

Indeed it would be nice if this "just worked" but it may take some time to sort out (sorry that this example initially got your hopes up!). Here what I mean by "address" is continuing to prevent non-nanosecond-precision datetime values from entering xarray through casting to nanosecond precision and raising an informative error if that is not possible. This of course would be temporary until we work through the kinks of enabling such support. In the big picture it is exciting that pandas is doing this in part due to your grant.

Jan 31 '23 12:01 spencerkclark

@khider It would be helpful if either you or someone on your team tried to make it work and opened a PR. That would give us a sense of what's needed and might speed it along. It would be an advanced change, but we'd be happy to provide feedback.

Adding expected-fail tests would be particularly helpful!

Jan 31 '23 17:01 dcherian

@dcherian +1. I'm happy to engage with others if they are motivated to start on this earlier.

Jan 31 '23 18:01 spencerkclark

I might need some help with the xarray codebase. I use it quite often but never had to dig into its guts.

Feb 01 '23 00:02 khider

@khider we are more than happy to help with digging into the codebase! A reasonable place to start would be just trying the operation you want to perform, and looking through the code for the functions any errors get thrown from.

You are also welcome to join our bi-weekly community meetings (there is one tomorrow morning!) or the office hours we run.

Feb 01 '23 01:02 TomNicholas

I can block out time to join today's meeting or an upcoming one if it would be helpful.

Feb 01 '23 15:02 spencerkclark

I can attend it too. 8:30am PST, correct?

Feb 01 '23 15:02 khider

Great -- I'll plan on joining. That's correct. It is at 8:30 AM PT (https://github.com/pydata/xarray/issues/4001).

Feb 01 '23 15:02 spencerkclark

Thanks for joining the meeting today @khider. Some potentially relevant places in the code that come to my mind are:

Though as @shoyer says, searching for datetime64[ns] or timedelta64[ns] will probably go a long way toward finding most of these issues.

Some design questions that come to my mind are (but you don't need an answer to these immediately to start working):

How do we decide which precision to decode times to? Would it be the finest precision that enables decoding without overflow?
This is admittedly in the weeds, but how do we decide when to use cftime and when not to? It seems obvious that in the long term we should use NumPy values for proleptic Gregorian dates of all precisions, but what about dates from the Gregorian calendar (where we may no longer have the luxury that the proleptic Gregorian and Gregorian calendars are equivalent for all representable times)?
Not a blocker (since this is an existing issue) but are there ways we could make working with mixed precision datetime values friendlier with regard to overflow (https://github.com/numpy/numpy/issues/16352)? I worry about examples like this:
```
>>> np.seterr(over="raise")
>>> np.datetime64("1970-01-01", "ns") - np.datetime64("0001-01-01", "D")
numpy.timedelta64(6795364578871345152,'ns')
```

Feb 01 '23 17:02 spencerkclark

Thank you!

The second point that you raise is what we are concerned about right now as well. So maybe it would be good to try to resolve it. How do you deal with PMIP simulations in terms of calendar?

Feb 01 '23 18:02 khider

Currently in xarray we make the choice based on the calendar attribute associated with the data on disk (following the CF conventions). If the data has a non-standard calendar (or cannot be represented with nanosecond-precision datetime values) then we use cftime; otherwise we use NumPy. Which kind of calendar do PMIP simulations typically use?

For some background -- my initial need in this realm came mainly from idealized climate model simulations (e.g. configured to start on 0001-01-01 with a no-leap calendar), so I do not have a ton of experience with paleoclimate research. I would be happy to learn more about your application, however!

Feb 01 '23 23:02 spencerkclark

Hi all, I just ran into a really nasty-to-track-down bug in xarray (version 2023.08.0, apologies if this is fixed since) where non-nanosecond datetimes are creeping in via expand_dims. Look at the difference between expand_dims and assign_coords:

In [33]: xarray.Dataset().expand_dims({'foo': [np.datetime64('2018-01-01')]})
Out[33]: 
<xarray.Dataset>
Dimensions:  (foo: 1)
Coordinates:
  * foo      (foo) datetime64[s] 2018-01-01
Data variables:
    *empty*

In [34]: xarray.Dataset().assign_coords({'foo': [np.datetime64('2018-01-01')]})
third_party/py/xarray/core/utils.py:1211: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values. This warning is caused by passing non-nanosecond np.datetime64 or np.timedelta64 values to the DataArray or Variable constructor; it can be silenced by converting the values to nanosecond precision ahead of time.
third_party/py/xarray/core/utils.py:1211: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values. This warning is caused by passing non-nanosecond np.datetime64 or np.timedelta64 values to the DataArray or Variable constructor; it can be silenced by converting the values to nanosecond precision ahead of time.
Out[34]: 
<xarray.Dataset>
Dimensions:  (foo: 1)
Coordinates:
  * foo      (foo) datetime64[ns] 2018-01-01
Data variables:
    *empty*

It seems for the time being xarray depends on datetime64[ns] being used everywhere for correct behaviour -- I've seen some very weird data corruption silently happen with datetimes when the wrong datetime64 types are used accidentally due to this bug. So good to be consistent about always enforcing datetime64[ns], for as long as this is the case.

Feb 19 '24 20:02 mjwillson

Agreed, many thanks for the report @mjwillson—we'll have to track down why this slips through in the case of expand_dims.

Feb 19 '24 20:02 spencerkclark

@mjwillson I think I tracked down the cause of the expand_dims issue—see #8782 for a fix.

Feb 24 '24 15:02 spencerkclark

xarray xarray copied to clipboard

Interoperability with Pandas 2.0 non-nanosecond datetime

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

xarray
xarray copied to clipboard