Potential regression in Dataset.from_dataframe() not preserving timezone
What happened?
Converting pandas DataFrame that has a datetime column with timezone to an xarray dataset does not preserve the timezone, this only breaks in version 2024.5
What did you expect to happen?
I would expect the timezone info to be preserved, as it was the case before.
Minimal Complete Verifiable Example
import pandas as pd
import xarray as xr
df1 = pd.DataFrame(
{"A": pd.date_range("20130101", periods=4, tz="US/Eastern"), "B": [1, 2, 3, 4]}
)
dataset = xr.Dataset.from_dataframe(df1)
df2 = dataset.to_dataframe()
print(df1.dtypes, dataset.dtypes, df2.dtypes, sep="\n\n")
MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
- [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
# On xarary 2024.5.0:
A datetime64[ns, US/Eastern]
B int64
dtype: object
Frozen({'A': dtype('<M8[ns]'), 'B': dtype('int64')})
A datetime64[ns]
B int64
dtype: object
# ---------------------------
# On previous versions:
A datetime64[ns, US/Eastern]
B int64
dtype: object
Frozen({'A': dtype('O'), 'B': dtype('int64')})
A datetime64[ns, US/Eastern]
B int64
dtype: object
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None python: 3.12.1 (tags/v3.12.1:2305ca5, Dec 7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD byteorder: little LC_ALL: None LANG: None LOCALE: ('English_United States', '1252') libhdf5: 1.14.2 libnetcdf: None
xarray: 2024.5.0 pandas: 2.2.2 numpy: 1.26.4 scipy: 1.12.0 netCDF4: None pydap: None h5netcdf: None h5py: 3.10.0 zarr: None cftime: None nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.8.3 cartopy: None seaborn: None numbagg: None fsspec: 2024.3.1 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 69.2.0 pip: 24.0 conda: None pytest: None mypy: None IPython: 8.22.2 sphinx: 7.2.6
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!
@ilan-gold are you able to take a look here please? I suspect it's related to extension array stuff
Is dtype('O') from previous versions correct though?
Ah, ok, I see, previously this was an array of TimeStamp objects and now is being converted in a numpy array with a "proper" datatype
It's possible the previous behaviour was unintentional and this one is more "correct"/consistent ... Some exploration and reporting would be very helpful.
Ok, so the problem is that DateTime is an extension array dtype: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.types.is_extension_array_dtype.html
I will look into properly preserving the dtype then, although I suspect there is something else going on regarding datetimes (or the testing is not specific enough to cover this case)