pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: inconsistent behavior of DateOffset

Open colomb8 opened this issue 2 years ago • 5 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({
   'S':[pd.Timestamp('2019-04-30')],
   'A':[pd.DateOffset(months=1)]
   })

# FIRST TEST
>>> df['S'] + 26 * df['A']
0   2021-06-30
dtype: datetime64[ns]

# SECOND TEST
>>> df['S'].iloc[0] + 26 * df['A'].iloc[0]
Timestamp('2021-06-28 00:00:00')

Issue Description

It seems like multiplying a DateOffset by constant leads to an inconsistent behavior: the first test gives 30/6 while the second 28/6

Expected Behavior

In any case it should be 2021-06-30

Installed Versions

INSTALLED VERSIONS

commit : 66e3805b8cabe977f40c05259cc3fcf7ead5687d python : 3.10.1.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : Intel64 Family 6 Model 126 Stepping 5, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : Italian_Italy.1252

pandas : 1.3.5 numpy : 1.21.5 pytz : 2021.3 dateutil : 2.8.2 pip : 21.3.1 setuptools : 58.1.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.8.0 html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.1.1 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.5.1 numexpr : None odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.7.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

colomb8 avatar Aug 03 '22 23:08 colomb8

While this is a bug, I would say that the second test actually gives the correct results (because you're adding one month at a time, when you add 1 month to 30-Jan-2020, you get 29-Feb-2020, and next year it'd be 28-Feb-2021). But it is a bug because the results mismatch when you add offsets to a series v/s a Timestamp. This would require a bug fix and maybe some test cases.

pratyushsharan avatar Aug 09 '22 06:08 pratyushsharan

I had another look into this, looks like the problem is inside cython code. Applying offset todf['S'] does this: months = (kwds.get("years", 0) * 12 + kwds.get("months", 0)) * self.n inside offsets.pyx, so in essence it calculates months=1*26=26months and then applies the offset as months=26 rather than applying months=1 offset 26 times (which is what happens when offset is applied to a Timestamp rather than DatetimeArray).

Summary: When applying multiple offsets to DatetimeArray, Cython tries to accumulate months/weeks/days first and then applies one single offset. But when offset is applied to a Timestamp object, the offsets are applied iteratively, which is more accurate when providing multiple offsets.

Solution: changes to offsets.pyx along with test cases.

Would like to know your views @jreback, @jbrockmendel

pratyushsharan avatar Aug 09 '22 07:08 pratyushsharan

While this is a bug, I would say that the second test actually gives the correct results (because you're adding one month at a time, when you add 1 month to 30-Jan-2020, you get 29-Feb-2020, and next year it'd be 28-Feb-2021). But it is a bug because the results mismatch when you add offsets to a series v/s a Timestamp. This would require a bug fix and maybe some test cases.

hi, I would say that the correct result should be 2021-06-30, since it is coherent with datatime relativedelta. Moreover, please consider the use case when you have a column with number of months and a column with a monthly dateoffset and let's imagine I have to obtain the 2021-06-30 effect; in such case I have to multiply the two columns and I have no other alternatives...(except creating different dateoffset rowwise, that is not so clean..)

colomb8 avatar Aug 09 '22 08:08 colomb8

hi I would add that the dataframe gives 2021-06-30 if it has 1 row and 2021-06-28 in case of 2 identical rows...

there are several bugs here, however imho the right result should be 2021-06-30 (coherent with relativedelta). "2 * 1 month" intuitively should be equal to "2 months", no?

colomb8 avatar Aug 09 '22 08:08 colomb8

It all comes down to how one interprets multiplying a offset/relativedelta by a scalar. I'll try to show by examples:

import pandas as pd
from dateutil.relativedelta import relativedelta

base_date = pd.Timestamp('2020-01-30')

offset_1month = pd.DateOffset(months=1)
offset_2month = pd.DateOffset(months=2)
delta_1month = relativedelta(months=1)
delta_2month = relativedelta(months=2)

base_date + 2 * offset_1month
>>> Timestamp('2020-03-29 00:00:00')

base_date + offset_1month + offset_1month
>>> Timestamp('2020-03-29 00:00:00')

base_date + offset_2month
>>> Timestamp('2020-03-30 00:00:00')

base_date + 2 * delta_1month
>>> Timestamp('2020-03-30 00:00:00')

base_date + delta_1month + delta_1month
>>> Timestamp('2020-03-29 00:00:00')

base_date + delta_2month
>>>Timestamp('2020-03-30 00:00:00')

As you can see, multiplying pd.DateOffset by scalar n basically means applying the pd.Offset n times. However, relativedelta operates in a different way, where if you multiply relativedelta by scalar n, they first re-calculate the relativedelta and then apply to the base_date, hence you see different values for 4th and 5th statements.

I think we need to first decide what we want to do when a pd.Offset is multiplied by a scalar n (apply pd.Offset n times or recalculate pd.Offset and then move the base_date).

pratyushsharan avatar Aug 09 '22 09:08 pratyushsharan

The difference arises due to the fact that in the offset.pyx file, where the addition is done, the scalar case uses a loop with a variable n in the RelativeDeltaOffset function _apply to calculate with multiplication, and an array simply multiplies the whole offset value by n in the RelativeDeltaOffset function _apply_array before adding.

srotondo avatar Aug 23 '22 21:08 srotondo

take

srotondo avatar Aug 24 '22 17:08 srotondo

Other similar instances of this operation in other related classes to DateOffset use the method where they calculate the combined offset of all n multiplications and move the date as such, so I think this instance should function the same way. I also think it's more in line with what a user performing this operation would want.

srotondo avatar Aug 24 '22 23:08 srotondo

Other similar instances of this operation in other related classes to DateOffset use the method where they calculate the combined offset of all n multiplications and move the date as such, so I think this instance should function the same way. I also think it's more in line with what a user performing this operation would want.

great.

please consider also this strange behavior: 1 row df differs from 2 rows df, even if these rows are equal

>>> df = pd.DataFrame({
   'S':[pd.Timestamp('2019-04-30')],
   'A':[pd.DateOffset(months=1)]
   })

>>> df['S'] + 26 * df['A']

0   2021-06-30
dtype: datetime64[ns]

>>> df2 = df.append(df)

>>> df2['S'] + 26 * df2['A']

0   2021-06-28
0   2021-06-28
dtype: datetime64[ns]

colomb8 avatar Aug 25 '22 12:08 colomb8

please consider also this strange behavior: 1 row df differs from 2 rows df, even if these rows are equal

I expect this is bc of a fastpath in _addsub_object_array. id be OK with removing that fastpath.

jbrockmendel avatar Aug 25 '22 23:08 jbrockmendel

for RelativeDeltaOffset._apply let's just multiply _offset by n instead of the loop. Will that be consistent with apply_array? If we can share anyhting with apply_array that'd be ideal

jbrockmendel avatar Aug 25 '22 23:08 jbrockmendel

this PR looked pretty close, if anyone wants to take it forward https://github.com/pandas-dev/pandas/pull/50542

MarcoGorelli avatar Apr 27 '23 08:04 MarcoGorelli

take

rsm-23 avatar Jun 13 '23 09:06 rsm-23

@MarcoGorelli this is my first open source contribution. Is it okay to follow the same steps as in #50542 and resolve the comments in my branch?

rsm-23 avatar Jun 13 '23 09:06 rsm-23

yeah probably

MarcoGorelli avatar Jun 13 '23 10:06 MarcoGorelli