pandas
pandas copied to clipboard
Convert result of group by agg to pyarrow if input is pyarrow
- [ ] closes #53030 (Replace xxxx with the GitHub issue number)
- [ ] Tests added and passed if fixing a bug or adding a new feature
- [ ] All code checks passed.
- [ ] Added type annotations to new arguments/methods/functions.
- [ ] Added an entry in the latest
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.
Root cause:
-
agg_series
always forces output dtype to be the same as input dtype, but depending on the lambda, the output dtype can be different
Fix:
- replace all NA with nan
- convert the `results' to respective pyarrow extension array, using pyarrow library methods
- pyarrow library methods is used instead of
maybe_convert_object
, asmaybe_convert_object
does not check for NA, and forces dtype to float if NA is present (NA is not float in pyarrow),
I have added the issue in the pr, the pr is still work in progress
Hi, I have completed the implementation, may i check why linux test if failing with NameError: name 'pa' is not defined
, but it works for other os?
df = DataFrame({"A": ["c1", "c2", "c3"], "B": [100, 200, 255]})
df["B"] = df["B"].astype("bool[pyarrow]")
gb = df.groupby("A")
result = gb.agg(lambda x: [1, 2, 3])
print(result["B"].dtype)
# list<item: int64>[pyarrow]
result = gb.agg(lambda x: [1, 2, "a"])
print(result["B"].dtype)
# object
Thank you for the review. For this example, it is expected as that is how pyarrow represents these data structures. E.g homogenous int list and heterogenous object. Alternatively, what would be the expected dtype in this case?
Sorry for the delay in resolving the test cases. Originally the approach is to always caste it back to input dtype, even if the output dtype is different. Currenlty the approach is to let pyarow infer the dtype of the output. However this can result in some unexpected cases as mentioned above. May I check what would be suggested output dtype to fix this issue?
For windows run, it seems to be failing as it is unable to import pyarrow in files unrelated to the code change in the pr, not sure if i have to change some configs.
May I check what would be suggested output dtype to fix this issue?
I think this is the same problem as the discussion here: https://github.com/pandas-dev/pandas/pull/58258#discussion_r1566446371. This appears to me something that will need careful implementation on the EA side.
Noted, I have made the change according to the comments using convert_dtypes
and a helper method to account for some unaccounted pyarrow dtypes. Although I am not sure if the change should be made here or in lib.maybe_convert_objects
.
Some of the windows tests are failing due to unable to import pyarrow in other section of the code unrelated to this change. Not sure if this is an issue on my side.
Thank you for the review, I have updated the pr according to comments.
Thank you foe the review, I have made changes according to the pr comments.
I think merging main once more should resolve the CI issues.
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.
I plan to push this across the finish line.
Ah sorry, I’ll get to it this weekend
On Thu, 20 Jun 2024 at 4:49 AM, Richard Shadrach @.***> wrote:
I plan to push this across the finish line.
— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/pull/58129#issuecomment-2179439892, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4UEHW5KQ6INWZS7EBRXLFLZIHVFVAVCNFSM6AAAAABFVB4M4WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZZGQZTSOBZGI . You are receiving this because you were mentioned.Message ID: @.***>
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.