pandas
pandas copied to clipboard
BUG: inconsistent behavior of DateOffset
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame({
'S':[pd.Timestamp('2019-04-30')],
'A':[pd.DateOffset(months=1)]
})
# FIRST TEST
>>> df['S'] + 26 * df['A']
0 2021-06-30
dtype: datetime64[ns]
# SECOND TEST
>>> df['S'].iloc[0] + 26 * df['A'].iloc[0]
Timestamp('2021-06-28 00:00:00')
Issue Description
It seems like multiplying a DateOffset by constant leads to an inconsistent behavior: the first test gives 30/6 while the second 28/6
Expected Behavior
In any case it should be 2021-06-30
Installed Versions
INSTALLED VERSIONS
commit : 66e3805b8cabe977f40c05259cc3fcf7ead5687d python : 3.10.1.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : Intel64 Family 6 Model 126 Stepping 5, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : Italian_Italy.1252
pandas : 1.3.5 numpy : 1.21.5 pytz : 2021.3 dateutil : 2.8.2 pip : 21.3.1 setuptools : 58.1.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.8.0 html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.1.1 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.5.1 numexpr : None odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.7.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
While this is a bug, I would say that the second test actually gives the correct results (because you're adding one month at a time, when you add 1 month to 30-Jan-2020, you get 29-Feb-2020, and next year it'd be 28-Feb-2021). But it is a bug because the results mismatch when you add offsets to a series v/s a Timestamp. This would require a bug fix and maybe some test cases.
I had another look into this, looks like the problem is inside cython code. Applying offset todf['S']
does this:
months = (kwds.get("years", 0) * 12 + kwds.get("months", 0)) * self.n
inside offsets.pyx, so in essence it calculates months=1*26=26months and then applies the offset as months=26
rather than applying months=1
offset 26 times (which is what happens when offset is applied to a Timestamp rather than DatetimeArray).
Summary: When applying multiple offsets to DatetimeArray, Cython tries to accumulate months/weeks/days first and then applies one single offset. But when offset is applied to a Timestamp object, the offsets are applied iteratively, which is more accurate when providing multiple offsets.
Solution: changes to offsets.pyx along with test cases.
Would like to know your views @jreback, @jbrockmendel
While this is a bug, I would say that the second test actually gives the correct results (because you're adding one month at a time, when you add 1 month to 30-Jan-2020, you get 29-Feb-2020, and next year it'd be 28-Feb-2021). But it is a bug because the results mismatch when you add offsets to a series v/s a Timestamp. This would require a bug fix and maybe some test cases.
hi, I would say that the correct result should be 2021-06-30, since it is coherent with datatime relativedelta. Moreover, please consider the use case when you have a column with number of months and a column with a monthly dateoffset and let's imagine I have to obtain the 2021-06-30 effect; in such case I have to multiply the two columns and I have no other alternatives...(except creating different dateoffset rowwise, that is not so clean..)
hi I would add that the dataframe gives 2021-06-30 if it has 1 row and 2021-06-28 in case of 2 identical rows...
there are several bugs here, however imho the right result should be 2021-06-30 (coherent with relativedelta). "2 * 1 month" intuitively should be equal to "2 months", no?
It all comes down to how one interprets multiplying a offset/relativedelta by a scalar. I'll try to show by examples:
import pandas as pd
from dateutil.relativedelta import relativedelta
base_date = pd.Timestamp('2020-01-30')
offset_1month = pd.DateOffset(months=1)
offset_2month = pd.DateOffset(months=2)
delta_1month = relativedelta(months=1)
delta_2month = relativedelta(months=2)
base_date + 2 * offset_1month
>>> Timestamp('2020-03-29 00:00:00')
base_date + offset_1month + offset_1month
>>> Timestamp('2020-03-29 00:00:00')
base_date + offset_2month
>>> Timestamp('2020-03-30 00:00:00')
base_date + 2 * delta_1month
>>> Timestamp('2020-03-30 00:00:00')
base_date + delta_1month + delta_1month
>>> Timestamp('2020-03-29 00:00:00')
base_date + delta_2month
>>>Timestamp('2020-03-30 00:00:00')
As you can see, multiplying pd.DateOffset
by scalar n
basically means applying the pd.Offset
n
times. However, relativedelta operates in a different way, where if you multiply relativedelta
by scalar n
, they first re-calculate the relativedelta
and then apply to the base_date
, hence you see different values for 4th and 5th statements.
I think we need to first decide what we want to do when a pd.Offset
is multiplied by a scalar n
(apply pd.Offset
n
times or recalculate pd.Offset
and then move the base_date
).
The difference arises due to the fact that in the offset.pyx
file, where the addition is done, the scalar case uses a loop with a variable n
in the RelativeDeltaOffset
function _apply
to calculate with multiplication, and an array simply multiplies the whole offset value by n
in the RelativeDeltaOffset
function _apply_array
before adding.
take
Other similar instances of this operation in other related classes to DateOffset
use the method where they calculate the combined offset of all n
multiplications and move the date as such, so I think this instance should function the same way. I also think it's more in line with what a user performing this operation would want.
Other similar instances of this operation in other related classes to
DateOffset
use the method where they calculate the combined offset of alln
multiplications and move the date as such, so I think this instance should function the same way. I also think it's more in line with what a user performing this operation would want.
great.
please consider also this strange behavior: 1 row df differs from 2 rows df, even if these rows are equal
>>> df = pd.DataFrame({
'S':[pd.Timestamp('2019-04-30')],
'A':[pd.DateOffset(months=1)]
})
>>> df['S'] + 26 * df['A']
0 2021-06-30
dtype: datetime64[ns]
>>> df2 = df.append(df)
>>> df2['S'] + 26 * df2['A']
0 2021-06-28
0 2021-06-28
dtype: datetime64[ns]
please consider also this strange behavior: 1 row df differs from 2 rows df, even if these rows are equal
I expect this is bc of a fastpath in _addsub_object_array. id be OK with removing that fastpath.
for RelativeDeltaOffset._apply let's just multiply _offset by n instead of the loop. Will that be consistent with apply_array? If we can share anyhting with apply_array that'd be ideal
this PR looked pretty close, if anyone wants to take it forward https://github.com/pandas-dev/pandas/pull/50542
take
@MarcoGorelli this is my first open source contribution. Is it okay to follow the same steps as in #50542 and resolve the comments in my branch?
yeah probably