pandas Convert result of group by agg to pyarrow if input is pyarrow

[ ] closes #53030 (Replace xxxx with the GitHub issue number)
[ ] Tests added and passed if fixing a bug or adding a new feature
[ ] All code checks passed.
[ ] Added type annotations to new arguments/methods/functions.
[ ] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Root cause:

agg_series always forces output dtype to be the same as input dtype, but depending on the lambda, the output dtype can be different

Fix:

replace all NA with nan
convert the `results' to respective pyarrow extension array, using pyarrow library methods
pyarrow library methods is used instead of maybe_convert_object, as maybe_convert_object does not check for NA, and forces dtype to float if NA is present (NA is not float in pyarrow),

Apr 03 '24 10:04 undermyumbrella1

I have added the issue in the pr, the pr is still work in progress

Apr 05 '24 06:04 undermyumbrella1

Hi, I have completed the implementation, may i check why linux test if failing with NameError: name 'pa' is not defined, but it works for other os?

Apr 08 '24 17:04 undermyumbrella1

df = DataFrame({"A": ["c1", "c2", "c3"], "B": [100, 200, 255]})
df["B"] = df["B"].astype("bool[pyarrow]")
gb = df.groupby("A")

result = gb.agg(lambda x: [1, 2, 3])
print(result["B"].dtype)
# list<item: int64>[pyarrow]

result = gb.agg(lambda x: [1, 2, "a"])
print(result["B"].dtype)
# object

Thank you for the review. For this example, it is expected as that is how pyarrow represents these data structures. E.g homogenous int list and heterogenous object. Alternatively, what would be the expected dtype in this case?

Apr 12 '24 05:04 undermyumbrella1

Sorry for the delay in resolving the test cases. Originally the approach is to always caste it back to input dtype, even if the output dtype is different. Currenlty the approach is to let pyarow infer the dtype of the output. However this can result in some unexpected cases as mentioned above. May I check what would be suggested output dtype to fix this issue?

For windows run, it seems to be failing as it is unable to import pyarrow in files unrelated to the code change in the pr, not sure if i have to change some configs.

Apr 21 '24 09:04 undermyumbrella1

May I check what would be suggested output dtype to fix this issue?

I think this is the same problem as the discussion here: https://github.com/pandas-dev/pandas/pull/58258#discussion_r1566446371. This appears to me something that will need careful implementation on the EA side.

Apr 22 '24 20:04 rhshadrach

Noted, I have made the change according to the comments using convert_dtypes and a helper method to account for some unaccounted pyarrow dtypes. Although I am not sure if the change should be made here or in lib.maybe_convert_objects.

Some of the windows tests are failing due to unable to import pyarrow in other section of the code unrelated to this change. Not sure if this is an issue on my side.

Apr 25 '24 16:04 undermyumbrella1

Thank you for the review, I have updated the pr according to comments.

Apr 29 '24 09:04 undermyumbrella1

Thank you foe the review, I have made changes according to the pr comments.

May 02 '24 07:05 undermyumbrella1

I think merging main once more should resolve the CI issues.

May 15 '24 20:05 rhshadrach

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Jun 15 '24 00:06 github-actions[bot]

I plan to push this across the finish line.

Jun 19 '24 20:06 rhshadrach

Ah sorry, I’ll get to it this weekend

On Thu, 20 Jun 2024 at 4:49 AM, Richard Shadrach @.***> wrote:

I plan to push this across the finish line.

— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/pull/58129#issuecomment-2179439892, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4UEHW5KQ6INWZS7EBRXLFLZIHVFVAVCNFSM6AAAAABFVB4M4WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZZGQZTSOBZGI . You are receiving this because you were mentioned.Message ID: @.***>

Jun 20 '24 02:06 undermyumbrella1

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Aug 09 '24 17:08 mroeschke

pandas pandas copied to clipboard

Convert result of group by agg to pyarrow if input is pyarrow

pandas
pandas copied to clipboard