pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: fixed Series.dt methods in ArrowDtype class that were returning incorrect values #57355

Open St0rmie opened this issue 11 months ago • 1 comments

  • [x] closes #57355
  • [x] Tests added and passed if fixing a bug or adding a new feature
  • [x] All code checks passed.
  • [x] Added type annotations to new arguments/methods/functions. (Not applicable)
  • [x] Added an entry in the latest doc/source/whatsnew/v3.0.0.rst file if fixing a bug or adding a new feature.

Fixed a bug where using the Series.dt methods for ArrowDtype would return incorrect time values and added a test to prevent that.

Reproducible example:

start_dates = (
    pd.Series([
        '2016-03-08',
    ], dtype='datetime64[ns]'
).astype('timestamp[ns][pyarrow]'))

end_dates = (
    pd.Series([
        '2017-03-08 23:59:59.999', 
     ], dtype='datetime64[ns]'
).astype('timestamp[ns][pyarrow]'))

duration = (end_dates - start_dates)

print(duration)
print(duration.dt.components)

Running the above example on main branch would give the following output:

0    365 days 23:59:59.999000
dtype: duration[ns][pyarrow]
   days  hours  minutes  seconds  milliseconds  microseconds  nanoseconds
0   365     23       59    86399           999        999000            0

By running the example with this fix, i get the following output:

0    365 days 23:59:59.999000
dtype: duration[ns][pyarrow]
   days  hours  minutes  seconds  milliseconds  microseconds  nanoseconds
0   365     23       59       59           999             0            0

St0rmie avatar Mar 28 '24 19:03 St0rmie

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

github-actions[bot] avatar Apr 28 '24 00:04 github-actions[bot]

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

github-actions[bot] avatar Jun 09 '24 00:06 github-actions[bot]

I apologize for the time i took to adress this issue. I tried to look for better ways to fix the issue and this seemed to be a reasonable option. In timedeltas we have two types of getters for seconds: timedelta.seconds which would return the hours, minutes and seconds in seconds and timedelta.components.seconds which would return the seconds specifically. While the second one returns the atributes of the class directly, the first one applies an operation to convert hours and minutes to seconds before summing the whole count of seconds.

In pyarrow, the components seem to be generated via the methods that i've changed (_dt_hours, _dt_minutes, _dt_seconds, etc. So, in order to change the return of components i had to change these methods. However, based on @WillAyd suggestion, i managed to find a way to do it without the loops by acessing the components in the timedelta generated which fixes the issue without compromising performance.

On the other hand, taking into consideration @jbrockmendel 's suggestion, i think that since the components are being generated via the methods i've changed the option here would be either changing how components are generated or creating another method that would return the hours, minutes and seconds in seconds just like in timedelta. In my opinion, having timedelta.seconds returning the hours, minutes and seconds in seconds but having timedelta.microseconds returning just the microseconds it's a bit confusing so i think it would be better to rename that method differently (such as time_as_seconds for example) and to create the corresponding method for the pyarrow structure.

I might be wrong or have missed something, so i would be very grateful if you could guide me through it if that's the case. Thank you for your time and patience.

St0rmie avatar Jun 15 '24 15:06 St0rmie

Thanks @St0rmie and sorry for the delay here

mroeschke avatar Jun 18 '24 18:06 mroeschke