pandas Use mask to create result_mask that filters nan categories

[x] closes #55326
[x] Tests added and passed if fixing a bug or adding a new feature
[x] All code checks passed.
[ ] Added type annotations to new arguments/methods/functions.
[x] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Mar 30 '24 09:03 undermyumbrella1

Is there an issue linked with this?

Mar 30 '24 12:03 asishm

Is there an issue linked with this?

No clue. @undermyumbrella1 I'd appreciate an explanation of what this change accomplishes. And please make sure all the code tests pass

Mar 30 '24 20:03 Aloqeely

this is a work in progress for issue #55326 , i have added the issue number

Mar 31 '24 15:03 undermyumbrella1

ok, the pr implementation is completed

Apr 05 '24 06:04 undermyumbrella1

HI thank you for the pr review, I have changed my implementation to temporarily set observed to true (and respective groupers), so that transform will return the correct result.

I have initially tried to change the result of getattr(self, func)(*args, **kwargs), by using grouped reduce to map each result block to out_dtype that was determined in _cython_operation. However this impl turned out to be way too complicated, as the out_dtype, out_shape, views of the original value block is determined by the entire nested sequence of method calls. Extracting this logic out proved to be complicated.

Apr 17 '24 16:04 undermyumbrella1

Thank you for the review, I have made the changes as requested

Apr 20 '24 10:04 undermyumbrella1

Thank you for the review, i have made the changes as suggested

Apr 21 '24 08:04 undermyumbrella1

Thanks for the changes @undermyumbrella1 - this is looking good! I have some minor refactor/style requests, but I'd like to get another eye here before any more work is done.

@mroeschke - would you be able to take a look? In addition to the issue linked in the OP, this is fixing a regression caused by #55738:

N = 10**3
data = {
    "a1": Categorical(np.random.randint(100, size=N), categories=np.arange(N)),
    "a2": Categorical(np.random.randint(100, size=N), categories=np.arange(N)),
    "b": np.random.random(N),
}
df = DataFrame(data)
%timeit df.groupby(["a1", "a2"], observed=False)["b"].transform("sum")
# 6.83 ms ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- main
# 687 µs ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- PR

While it's undesirable to swap out the grouper as is done here, I do not see any better way. There may be more efficient ways of computed the observed codes / result_index, but that can be readily built upon this later on.

Apr 25 '24 21:04 rhshadrach

Thank you for the review, I have updated the pr according to comments.

Apr 29 '24 08:04 undermyumbrella1

Thank you for the review, I have updated the pr according to comments.

May 02 '24 04:05 undermyumbrella1

Thank you for the review, I have updated the pr according to comments. Noted on force pushing

May 07 '24 04:05 undermyumbrella1

Thanks @undermyumbrella1 - very nice!

May 08 '24 22:05 rhshadrach

pandas pandas copied to clipboard

Use mask to create result_mask that filters nan categories

pandas
pandas copied to clipboard