pandas
pandas copied to clipboard
BUG: fixed Series.dt methods in ArrowDtype class that were returning incorrect values #57355
- [x] closes #57355
- [x] Tests added and passed if fixing a bug or adding a new feature
- [x] All code checks passed.
- [x] Added type annotations to new arguments/methods/functions. (Not applicable)
- [x] Added an entry in the latest
doc/source/whatsnew/v3.0.0.rst
file if fixing a bug or adding a new feature.
Fixed a bug where using the Series.dt
methods for ArrowDtype
would return incorrect time values and added a test to prevent that.
Reproducible example:
start_dates = (
pd.Series([
'2016-03-08',
], dtype='datetime64[ns]'
).astype('timestamp[ns][pyarrow]'))
end_dates = (
pd.Series([
'2017-03-08 23:59:59.999',
], dtype='datetime64[ns]'
).astype('timestamp[ns][pyarrow]'))
duration = (end_dates - start_dates)
print(duration)
print(duration.dt.components)
Running the above example on main branch would give the following output:
0 365 days 23:59:59.999000
dtype: duration[ns][pyarrow]
days hours minutes seconds milliseconds microseconds nanoseconds
0 365 23 59 86399 999 999000 0
By running the example with this fix, i get the following output:
0 365 days 23:59:59.999000
dtype: duration[ns][pyarrow]
days hours minutes seconds milliseconds microseconds nanoseconds
0 365 23 59 59 999 0 0
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.
I apologize for the time i took to adress this issue.
I tried to look for better ways to fix the issue and this seemed to be a reasonable option.
In timedeltas we have two types of getters for seconds: timedelta.seconds
which would return the hours, minutes and seconds in seconds and timedelta.components.seconds
which would return the seconds specifically. While the second one returns the atributes of the class directly, the first one applies an operation to convert hours and minutes to seconds before summing the whole count of seconds.
In pyarrow, the components
seem to be generated via the methods that i've changed (_dt_hours
, _dt_minutes
, _dt_seconds
, etc. So, in order to change the return of components
i had to change these methods. However, based on @WillAyd suggestion, i managed to find a way to do it without the loops by acessing the components in the timedelta generated which fixes the issue without compromising performance.
On the other hand, taking into consideration @jbrockmendel 's suggestion, i think that since the components are being generated via the methods i've changed the option here would be either changing how components are generated or creating another method that would return the hours, minutes and seconds in seconds just like in timedelta. In my opinion, having timedelta.seconds
returning the hours, minutes and seconds in seconds but having timedelta.microseconds
returning just the microseconds it's a bit confusing so i think it would be better to rename that method differently (such as time_as_seconds
for example) and to create the corresponding method for the pyarrow structure.
I might be wrong or have missed something, so i would be very grateful if you could guide me through it if that's the case. Thank you for your time and patience.
Thanks @St0rmie and sorry for the delay here