pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: Fix `is_unique` regression for slices of `Index`es

Open rob-sil opened this issue 1 year ago • 7 comments

  • [x] closes #57911
  • [x] Tests added and passed if fixing a bug or adding a new feature
  • [x] All code checks passed.
  • [ ] Added type annotations to new arguments/methods/functions.
  • [x] Added an entry in the latest doc/source/whatsnew/v2.2.2.rst file if fixing a bug or adding a new feature.

Slicing a unique Index will always give another unique Index, so inheriting uniqueness flags is safe and efficient. However, the slice of a non-unique Index can end up with unique elements. Inheriting in the non-unique case caused the regression so this PR changes the code to just inherit when the base Index is marked as unique.

rob-sil avatar Mar 22 '24 04:03 rob-sil

A note or question for reviewers: With this extra code, is pre-computing uniqueness and monotonicity for indexes still a performance boost?

I'm still seeing gains for calling is_monotonic_increasing on a sliced Index, almost as large as in #51738. However, pre-computing slows down slicing when neither is_unique nor is_monotonic_increasing is called on the resulting Index. With no benefit, pre-computing is a performance decrease. I'd expect that slicing data frames and series is more common than calling is_monotonic_increasing on a sliced index, but is it significantly more common? For a full offset, I think it would have to be more than 500x as frequent.

This PR:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(index=list(range(1_000_000)))

In [3]: df.index.is_monotonic_increasing
Out[3]: True

In [4]: %timeit df.index[:].is_monotonic_increasing
3.79 µs ± 19.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [5]: %timeit df[:]
8.23 µs ± 63.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Without pre-computing:

In [5]: %timeit df.index[:].is_monotonic_increasing
1.28 ms ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [6]: %timeit df[:]
4.75 µs ± 2.95 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

rob-sil avatar Mar 26 '24 04:03 rob-sil

But it only slows it down by a large margin if is_unique or is_monotonic_increasing were called on the original index, right? Thoughts @mroeschke?

Aloqeely avatar Mar 26 '24 10:03 Aloqeely

The pre-computing code should run anytime Index has generated its _engine, which can come from a couple other methods in addition to is_unique and is_monotonic_increasing. I think DataFrame.loc indirectly calls one of these indexer methods.

EDIT: as a quick check, I raised in _update_from_sliced (where the pre-computing happens) and ran part of the test suite. Looks like it can run in other places too, like DataFrame.join.

rob-sil avatar Mar 26 '24 12:03 rob-sil

I think DataFrame.join uses is_monotonic_increasing and is_unique, although I'm not sure if they're in a place to benefit from the pre-computation. On net, the code's impact on efficiency looks pretty complicated and might take awhile to fully count.

To avoid scope creep in this issue, how about I open a separate issue for a performance check and have this PR move forward with the bug fix?

rob-sil avatar Mar 29 '24 04:03 rob-sil

To avoid scope creep in this issue, how about I open a separate issue for a performance check and have this PR move forward with the bug fix?

Sure, sounds good to me.

Aloqeely avatar Mar 30 '24 19:03 Aloqeely

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

github-actions[bot] avatar Apr 30 '24 00:04 github-actions[bot]

Sorry this is not stale, @WillAyd mind having a look? Initially looks good to me. (I just pinged because you got assigned to this, let me know if I should ping someone else)

Aloqeely avatar May 07 '24 03:05 Aloqeely

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Updated for 3.0.0 now.

rob-sil avatar May 27 '24 21:05 rob-sil

Is there anything else I should do to make this PR not stale any more?

rob-sil avatar Jun 04 '24 01:06 rob-sil

Is there anything else I should do to make this PR not stale any more?

I removed the label. @mroeschke mind taking a look please?

Aloqeely avatar Jul 20 '24 20:07 Aloqeely

Thanks @rob-sil

mroeschke avatar Aug 06 '24 01:08 mroeschke