BUG: In 1.5rc0 casting Series to datetime64 with specific but non-nanosecond units has no effect
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
from pandas.testing import assert_series_equal
print(f"{pd.__version__=}")
ser = pd.Series(["2000-02-02", "2000-03-03"], dtype="datetime64[ns]")
# Passes on pandas 1.4.4
# Fails on pandas 1.5.0rc0
# Fails on pandas 1.6.0.dev0+136.gbbf17ea692
assert_series_equal(
ser.astype("datetime64[Y]"),
pd.Series(["2000-01-01", "2000-01-01"], dtype="datetime64[ns]")
)
Issue Description
In pandas 1.4.4 and earlier, it was possible to affect the contents of datetime64[ns] type Series by casting to datetime64 with other non-nanosecond units, even though the resulting Series still had datetime64[ns] as its type. As of 1.5rc0 this behavior seems to have changed. Casting to datetime64 with other units no longer seems to have any effect.
Based on comments in PR #48555 (closing issue #47844) referring to the numpy unit conversion it seems like this might not be the intended behavior, and it's a breaking change (we were relying on this behavior to turn month-start dates into the corresponding year-start dates). Snippet from that PR:
elif (
self.tz is None
and is_datetime64_dtype(dtype)
and dtype != self.dtype
and is_unitless(dtype)
):
# TODO(2.0): just fall through to dtl.DatetimeLikeArrayMixin.astype
warnings.warn(
"Passing unit-less datetime64 dtype to .astype is deprecated "
"and will raise in a future version. Pass 'datetime64[ns]' instead",
FutureWarning,
stacklevel=find_stack_level(inspect.currentframe()),
)
# unit conversion e.g. datetime64[s]
return self._ndarray.astype(dtype)
Expected Behavior
I expected the dates in the series to be adjusted to be consistent with the frequency of the datetime64 type used in astype(), as illustrated in the example above.
Installed Versions
INSTALLED VERSIONS
commit : bbf17ea692e437cec908eae6759ffff8092fb42e python : 3.10.6.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-47-generic Version : #51-Ubuntu SMP Thu Aug 11 07:51:15 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.6.0.dev0+136.gbbf17ea692 numpy : 1.23.3 pytz : 2022.2.1 dateutil : 2.8.2 setuptools : 65.3.0 pip : 22.2.2 Cython : 0.29.32 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None
cc @jbrockmendel thoughts here?
Can fix for 1.5.1. For 2.0 we'll support [s, ms, us, ns], and i think astype to anything else will raise. long-term the user should use .floor i think
Messing around with this a bit, Series.dt.floor() works well for Day, Hour, & Minute frequencies, but it doesn't allow snapping to the the start of years or months, so now I'm pulling out the numpy array and casting it to another frequency and constructing a new Series. It seems like there must be a better way I'm not aware of.
import numpy as np
import pandas as pd
df = (
pd.DataFrame()
.assign(
hourly=pd.Series(np.arange("2000-01-01", "2003-01-01", dtype="datetime64[h]")),
daily=lambda x: x.hourly.dt.floor("D"),
monthly=lambda x: pd.Series(x["hourly"].to_numpy().astype("datetime64[M]")),
yearly=lambda x: pd.Series(x["hourly"].to_numpy().astype("datetime64[Y]")),
)
)
df.sample(10)
| hourly | daily | monthly | yearly | |
|---|---|---|---|---|
| 5280 | 2000-08-08 00:00:00 | 2000-08-08 | 2000-08-01 | 2000-01-01 |
| 6378 | 2000-09-22 18:00:00 | 2000-09-22 | 2000-09-01 | 2000-01-01 |
| 18764 | 2002-02-20 20:00:00 | 2002-02-20 | 2002-02-01 | 2002-01-01 |
| 25739 | 2002-12-08 11:00:00 | 2002-12-08 | 2002-12-01 | 2002-01-01 |
| 8285 | 2000-12-11 05:00:00 | 2000-12-11 | 2000-12-01 | 2000-01-01 |
| 16114 | 2001-11-02 10:00:00 | 2001-11-02 | 2001-11-01 | 2001-01-01 |
| 15167 | 2001-09-23 23:00:00 | 2001-09-23 | 2001-09-01 | 2001-01-01 |
| 3424 | 2000-05-22 16:00:00 | 2000-05-22 | 2000-05-01 | 2000-01-01 |
| 19729 | 2002-04-02 01:00:00 | 2002-04-02 | 2002-04-01 | 2002-01-01 |
| 3148 | 2000-05-11 04:00:00 | 2000-05-11 | 2000-05-01 | 2000-01-01 |
but it doesn't allow snapping to the the start of years or months
Good catch. I think we'd need something like pd.offsets.YearStart().rollback(obj) but for arrays instead of scalars xref #7449.
For 1.5.0 your best bet is, like you've found, doing the astype directly on the underlying numpy arrays.
For 1.5.1 we can restore that behavior.
moving off 2.0 as ser.astype("datetime64[Y]") raises now anyway
Is there a plan to restore ser.astype("datetime64[Y]") in future versions? Or it will always raise?
hey @lingyielia - this will continue to raise
are you trying to floor to the beginning of the year? If so, you could do
ser + pd.tseries.offsets.YearBegin() - pd.tseries.offsets.YearBegin()
In the future, it should be possible to do pd.tseries.offsets.YearBegin.rollbackward(ser)