pandas
pandas copied to clipboard
Use mask to create result_mask that filters nan categories
- [x] closes #55326
- [x] Tests added and passed if fixing a bug or adding a new feature
- [x] All code checks passed.
- [ ] Added type annotations to new arguments/methods/functions.
- [x] Added an entry in the latest
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.
Is there an issue linked with this?
Is there an issue linked with this?
No clue. @undermyumbrella1 I'd appreciate an explanation of what this change accomplishes. And please make sure all the code tests pass
this is a work in progress for issue #55326 , i have added the issue number
ok, the pr implementation is completed
HI thank you for the pr review, I have changed my implementation to temporarily set observed to true (and respective groupers), so that transform will return the correct result.
I have initially tried to change the result of getattr(self, func)(*args, **kwargs)
, by using grouped reduce to map each result block to out_dtype
that was determined in _cython_operation
. However this impl turned out to be way too complicated, as the out_dtype
, out_shape
, views of the original value block is determined by the entire nested sequence of method calls. Extracting this logic out proved to be complicated.
Thank you for the review, I have made the changes as requested
Thank you for the review, i have made the changes as suggested
Thanks for the changes @undermyumbrella1 - this is looking good! I have some minor refactor/style requests, but I'd like to get another eye here before any more work is done.
@mroeschke - would you be able to take a look? In addition to the issue linked in the OP, this is fixing a regression caused by #55738:
N = 10**3
data = {
"a1": Categorical(np.random.randint(100, size=N), categories=np.arange(N)),
"a2": Categorical(np.random.randint(100, size=N), categories=np.arange(N)),
"b": np.random.random(N),
}
df = DataFrame(data)
%timeit df.groupby(["a1", "a2"], observed=False)["b"].transform("sum")
# 6.83 ms ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <-- main
# 687 µs ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) <-- PR
While it's undesirable to swap out the grouper as is done here, I do not see any better way. There may be more efficient ways of computed the observed codes / result_index, but that can be readily built upon this later on.
Thank you for the review, I have updated the pr according to comments.
Thank you for the review, I have updated the pr according to comments.
Thank you for the review, I have updated the pr according to comments. Noted on force pushing
Thanks @undermyumbrella1 - very nice!