BUG/PERF: use lexsort_indexer in MultiIndex.argsort
- [x] Tests added and passed if fixing a bug or adding a new feature
- [x] All code checks passed.
- [x] Added type annotations to new arguments/methods/functions.
- [x] Added an entry in the latest
doc/source/whatsnew/v1.6.0.rstfile if fixing a bug or adding a new feature.
cc: @phofl, @mroeschke - follow on to #48406
Had I been aware of lexsort_indexer I would have used that in #48406. Using lexsort_indexer handles pd.NA (np.lexsort raises) and has better perf for nullable types (Int64 on par with int64).
While the perf for non-nullable types (e.g. int64) suffers a bit, this is still much faster than prior to #48406.
before after ratio
[4e72340d] [ead3c00d]
<main> <multiindex-argsort>
+ 1.78±0.03ms 3.59±0.4ms 2.01 multiindex_object.SortValues.time_sort_values('int64')
- 26.7±0.5ms 3.51±0.5ms 0.13 multiindex_object.SortValues.time_sort_values('Int64')
Special casing only numpy dtypes makes it ugly, correct?
That was my thinking. DataFrame.sort_index uses the same approach with lexsort_indexer and does not special case numpy dtypes.
Ok, sounds reasonable
Can you merge main to get ci green?
Would be good to have a test that calls
MultiIndex.argsortdirectly
added an explicit argsort test
Just merged the other pr. Could you maybe run asvs once more? Union has a bunch of code paths, want to see if this is affected
Just merged the other pr. Could you maybe run asvs once more? Union has a bunch of code paths, want to see if this is affected
I don’t think this touches union. Were you thinking of #48504 ?
@phofl, @mroeschke
This is now failing following the merge of #48505. The test failing is test_union_sort_other_incomparable which expects to see a RunTimeWarning from MultiIndex._union. However, this uncovers some inconsistency in the main branch.
Here is an example:
import pandas as pd
idx = pd.MultiIndex.from_product([[pd.Timestamp("2000"), 1], ["a", "b"]])
df = pd.DataFrame({"A": 1}, index=idx)
# on main, this returns without warning/error as it goes through lexsort_indexer
df.sort_index()
# on main, this raises a TypeError as it goes through np.lexsort
df.index.sort_values()
With this PR the behavior will be consistent as we'll be sending both through lexsort_indexer. I've removed the expectation for a RunTimeWarning from the test. Let me know if that doesn't make sense.
@phofl - gentle ping. I think this is ready.
thx. @lukemanley