pandas icon indicating copy to clipboard operation
pandas copied to clipboard

Convert result of group by agg to pyarrow if input is pyarrow

Open undermyumbrella1 opened this issue 10 months ago • 12 comments

  • [ ] closes #53030 (Replace xxxx with the GitHub issue number)
  • [ ] Tests added and passed if fixing a bug or adding a new feature
  • [ ] All code checks passed.
  • [ ] Added type annotations to new arguments/methods/functions.
  • [ ] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Root cause:

  • agg_series always forces output dtype to be the same as input dtype, but depending on the lambda, the output dtype can be different

Fix:

  • replace all NA with nan
  • convert the `results' to respective pyarrow extension array, using pyarrow library methods
  • pyarrow library methods is used instead of maybe_convert_object, as maybe_convert_object does not check for NA, and forces dtype to float if NA is present (NA is not float in pyarrow),

undermyumbrella1 avatar Apr 03 '24 10:04 undermyumbrella1

I have added the issue in the pr, the pr is still work in progress

undermyumbrella1 avatar Apr 05 '24 06:04 undermyumbrella1

Hi, I have completed the implementation, may i check why linux test if failing with NameError: name 'pa' is not defined, but it works for other os?

undermyumbrella1 avatar Apr 08 '24 17:04 undermyumbrella1

df = DataFrame({"A": ["c1", "c2", "c3"], "B": [100, 200, 255]})
df["B"] = df["B"].astype("bool[pyarrow]")
gb = df.groupby("A")

result = gb.agg(lambda x: [1, 2, 3])
print(result["B"].dtype)
# list<item: int64>[pyarrow]

result = gb.agg(lambda x: [1, 2, "a"])
print(result["B"].dtype)
# object

Thank you for the review. For this example, it is expected as that is how pyarrow represents these data structures. E.g homogenous int list and heterogenous object. Alternatively, what would be the expected dtype in this case?

undermyumbrella1 avatar Apr 12 '24 05:04 undermyumbrella1

Sorry for the delay in resolving the test cases. Originally the approach is to always caste it back to input dtype, even if the output dtype is different. Currenlty the approach is to let pyarow infer the dtype of the output. However this can result in some unexpected cases as mentioned above. May I check what would be suggested output dtype to fix this issue?

For windows run, it seems to be failing as it is unable to import pyarrow in files unrelated to the code change in the pr, not sure if i have to change some configs.

undermyumbrella1 avatar Apr 21 '24 09:04 undermyumbrella1

May I check what would be suggested output dtype to fix this issue?

I think this is the same problem as the discussion here: https://github.com/pandas-dev/pandas/pull/58258#discussion_r1566446371. This appears to me something that will need careful implementation on the EA side.

rhshadrach avatar Apr 22 '24 20:04 rhshadrach

Noted, I have made the change according to the comments using convert_dtypes and a helper method to account for some unaccounted pyarrow dtypes. Although I am not sure if the change should be made here or in lib.maybe_convert_objects.

Some of the windows tests are failing due to unable to import pyarrow in other section of the code unrelated to this change. Not sure if this is an issue on my side.

undermyumbrella1 avatar Apr 25 '24 16:04 undermyumbrella1

Thank you for the review, I have updated the pr according to comments.

undermyumbrella1 avatar Apr 29 '24 09:04 undermyumbrella1

Thank you foe the review, I have made changes according to the pr comments.

undermyumbrella1 avatar May 02 '24 07:05 undermyumbrella1

I think merging main once more should resolve the CI issues.

rhshadrach avatar May 15 '24 20:05 rhshadrach

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

github-actions[bot] avatar Jun 15 '24 00:06 github-actions[bot]

I plan to push this across the finish line.

rhshadrach avatar Jun 19 '24 20:06 rhshadrach

Ah sorry, I’ll get to it this weekend

On Thu, 20 Jun 2024 at 4:49 AM, Richard Shadrach @.***> wrote:

I plan to push this across the finish line.

— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/pull/58129#issuecomment-2179439892, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4UEHW5KQ6INWZS7EBRXLFLZIHVFVAVCNFSM6AAAAABFVB4M4WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZZGQZTSOBZGI . You are receiving this because you were mentioned.Message ID: @.***>

undermyumbrella1 avatar Jun 20 '24 02:06 undermyumbrella1

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

mroeschke avatar Aug 09 '24 17:08 mroeschke