pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: Inconsistent resolution from constructing with and assigning datetime values

Open zhu0210 opened this issue 2 years ago • 4 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from datetime import datetime
import pandas as pd

construct = pd.DataFrame([datetime(year=2024, month=1, day=26, hour=0, minute=0, second=0, microsecond=0)])
print(construct.info())

assign = pd.DataFrame([], index=[0])
assign["time"] = datetime(year=2024, month=1, day=26, hour=0, minute=0, second=0, microsecond=0)
print(assign.info())

assign_timestamp = pd.DataFrame([], index=[0])
assign_timestamp["time"] = pd.Timestamp(year=2024, month=1, day=26, hour=0, minute=0, second=0, microsecond=0)
print(assign_timestamp.info())

assign_loc = pd.DataFrame([], index=[0])
assign_loc.loc[0, "time"] = datetime(year=2024, month=1, day=26, hour=0, minute=0, second=0, microsecond=0)
print(assign_loc.info())

Issue Description

Hi! I'm faced with a problem about datetime columns. It was met in our prod env upgrade from pandas 1.5.3 to 2.1.4. Before that constructing and assigning both result in datetime64[ns]. But asof 2.1.4, the following results were get:

print(construct.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   0       1 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 140.0 bytes
None

print(assign.info())
<class 'pandas.core.frame.DataFrame'>
Index: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   time    1 non-null      datetime64[us]
dtypes: datetime64[us](1)
memory usage: 16.0 bytes
None

print(assign_timestamp.info())
<class 'pandas.core.frame.DataFrame'>
Index: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   time    1 non-null      datetime64[us]
dtypes: datetime64[us](1)
memory usage: 16.0 bytes
None

print(assign_loc.info())
<class 'pandas.core.frame.DataFrame'>
Index: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   time    1 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 124.0 bytes
None

While construction and loc assignment gives the dtype datetime64[ns], df["time"] = datetime(...) results in datetime64[us].

I tested in a different env using pandas 2.2.0. The result is different when using loc:

print(assign_loc.info())
<class 'pandas.core.frame.DataFrame'>
Index: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   time    1 non-null      datetime64[us]
dtypes: datetime64[us](1)
memory usage: 124.0 bytes
None

This time assign_loc gives datetime64[us], but construct still gives datetime64[ns].

I've noticed in the release notes that changes were made to astype, making it parse inputs according to the original resolution, but it might not explain the inconsistency intuitively. This is causing massive concat errors in our code due to inconsistent dtypes, and we are truly concerned about it as we are not sure if this is a bug and what behavior will be stabilized. We feel like further confirming it.

Thank you so much and wish you all the best!

Expected Behavior

Consistent datetime resolution

Installed Versions

2.1.4

INSTALLED VERSIONS

commit : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673 python : 3.11.7.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22621 machine : AMD64 processor : Intel64 Family 6 Model 151 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8 pandas : 2.1.4 numpy : 1.26.3 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : None pytest : 7.4.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.3 html5lib : None pymysql : 1.0.2 psycopg2 : None jinja2 : 3.1.2 IPython : 8.20.0 pandas_datareader : None bs4 : 4.12.2 bottleneck : 1.3.5 dataframe-api-compat: None fastparquet : 2023.8.0 fsspec : 2023.10.0 gcsfs : None matplotlib : 3.8.0 numba : 0.58.1 numexpr : 2.8.7 odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.11.4 sqlalchemy : 2.0.25 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

TEST 2.2.0:

INSTALLED VERSIONS

commit : f538741432edf55c6b9fb5d0d496d2dd1d7c2457 python : 3.12.0.final.0 python-bits : 64 OS : Windows OS-release : 11 Version : 10.0.22621 machine : AMD64 processor : Intel64 Family 6 Model 151 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8 pandas : 2.2.0 numpy : 1.26.3 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.18.1 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.11.4 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

zhu0210 avatar Jan 26 '24 03:01 zhu0210

Bisected this issue, regression happened on this commit:

a2bb939ebd3aa9d78f96a72fa575f67b076f0bfa is the first bad commit
commit a2bb939ebd3aa9d78f96a72fa575f67b076f0bfa
Author: jbrockmendel <[email protected]>
Date:   Thu May 18 09:11:48 2023 -0700

    API/BUG: infer_dtype_from_scalar with non-nano (#52212)
    
    * API/BUG: infer_dtype_from_scalar with non-nano
    * update test
    * xfail on 32bit
    * fix xfail condition
    * whatsnew
    * xfail on windows

File used for bisecting

# t.py file

from datetime import datetime
import pandas as pd

construct = pd.DataFrame([datetime(year=2024, month=1, day=26, hour=0, minute=0, second=0, microsecond=0)])
assert str(construct.iloc[:,0].dtype) == 'datetime64[ns]'

assign = pd.DataFrame([], index=[0])
assign["time"] = datetime(year=2024, month=1, day=26, hour=0, minute=0, second=0, microsecond=0)
assert str(assign.iloc[:,0].dtype) == 'datetime64[ns]'

assign_timestamp = pd.DataFrame([], index=[0])
assign_timestamp["time"] = pd.Timestamp(year=2024, month=1, day=26, hour=0, minute=0, second=0, microsecond=0)
assert str(assign_timestamp.iloc[:,0].dtype) == 'datetime64[ns]'

assign_loc = pd.DataFrame([], index=[0])
assign_loc.loc[0, "time"] = datetime(year=2024, month=1, day=26, hour=0, minute=0, second=0, microsecond=0)
assert str(assign_loc.iloc[:,0].dtype) == 'datetime64[ns]'

Before that all dtypes are datetime64[ns] After:

>>> print(pd.__version__)
2.1.0.dev0+797.ga2bb939ebd.dirty
>>> str(construct.iloc[:,0].dtype)
'datetime64[ns]'
>>> str(assign.iloc[:,0].dtype)
'datetime64[us]'
>>> str(assign_timestamp.iloc[:,0].dtype)
'datetime64[us]'
>>> str(assign_loc.iloc[:,0].dtype)
'datetime64[ns]'

aidoskanapyanov avatar Jan 30 '24 11:01 aidoskanapyanov

Also, I think the ns and us resolutions depend on whether we are assigning a scalar or a list. Refencing https://github.com/pandas-dev/pandas/issues/55014#issuecomment-1764341895

aidoskanapyanov avatar Jan 30 '24 11:01 aidoskanapyanov

cc @jbrockmendel

rhshadrach avatar Feb 11 '24 03:02 rhshadrach

Should be addressed by #55901

jbrockmendel avatar Feb 14 '24 18:02 jbrockmendel