BUG: Fix `is_unique` regression for slices of `Index`es
- [x] closes #57911
- [x] Tests added and passed if fixing a bug or adding a new feature
- [x] All code checks passed.
- [ ] Added type annotations to new arguments/methods/functions.
- [x] Added an entry in the latest
doc/source/whatsnew/v2.2.2.rstfile if fixing a bug or adding a new feature.
Slicing a unique Index will always give another unique Index, so inheriting uniqueness flags is safe and efficient. However, the slice of a non-unique Index can end up with unique elements. Inheriting in the non-unique case caused the regression so this PR changes the code to just inherit when the base Index is marked as unique.
A note or question for reviewers: With this extra code, is pre-computing uniqueness and monotonicity for indexes still a performance boost?
I'm still seeing gains for calling is_monotonic_increasing on a sliced Index, almost as large as in #51738. However, pre-computing slows down slicing when neither is_unique nor is_monotonic_increasing is called on the resulting Index. With no benefit, pre-computing is a performance decrease. I'd expect that slicing data frames and series is more common than calling is_monotonic_increasing on a sliced index, but is it significantly more common? For a full offset, I think it would have to be more than 500x as frequent.
This PR:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(index=list(range(1_000_000)))
In [3]: df.index.is_monotonic_increasing
Out[3]: True
In [4]: %timeit df.index[:].is_monotonic_increasing
3.79 µs ± 19.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [5]: %timeit df[:]
8.23 µs ± 63.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Without pre-computing:
In [5]: %timeit df.index[:].is_monotonic_increasing
1.28 ms ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [6]: %timeit df[:]
4.75 µs ± 2.95 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
But it only slows it down by a large margin if is_unique or is_monotonic_increasing were called on the original index, right?
Thoughts @mroeschke?
The pre-computing code should run anytime Index has generated its _engine, which can come from a couple other methods in addition to is_unique and is_monotonic_increasing. I think DataFrame.loc indirectly calls one of these indexer methods.
EDIT: as a quick check, I raised in _update_from_sliced (where the pre-computing happens) and ran part of the test suite. Looks like it can run in other places too, like DataFrame.join.
I think DataFrame.join uses is_monotonic_increasing and is_unique, although I'm not sure if they're in a place to benefit from the pre-computation. On net, the code's impact on efficiency looks pretty complicated and might take awhile to fully count.
To avoid scope creep in this issue, how about I open a separate issue for a performance check and have this PR move forward with the bug fix?
To avoid scope creep in this issue, how about I open a separate issue for a performance check and have this PR move forward with the bug fix?
Sure, sounds good to me.
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.
Sorry this is not stale, @WillAyd mind having a look? Initially looks good to me. (I just pinged because you got assigned to this, let me know if I should ping someone else)
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.
Updated for 3.0.0 now.
Is there anything else I should do to make this PR not stale any more?
Is there anything else I should do to make this PR not stale any more?
I removed the label. @mroeschke mind taking a look please?
Thanks @rob-sil