pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG/PERF: use lexsort_indexer in MultiIndex.argsort

Open lukemanley opened this issue 3 years ago • 7 comments

cc: @phofl, @mroeschke - follow on to #48406

Had I been aware of lexsort_indexer I would have used that in #48406. Using lexsort_indexer handles pd.NA (np.lexsort raises) and has better perf for nullable types (Int64 on par with int64).

While the perf for non-nullable types (e.g. int64) suffers a bit, this is still much faster than prior to #48406.

       before           after         ratio
     [4e72340d]       [ead3c00d]
     <main>           <multiindex-argsort>
+     1.78±0.03ms       3.59±0.4ms     2.01  multiindex_object.SortValues.time_sort_values('int64')
-      26.7±0.5ms       3.51±0.5ms     0.13  multiindex_object.SortValues.time_sort_values('Int64')

lukemanley avatar Sep 10 '22 10:09 lukemanley

Special casing only numpy dtypes makes it ugly, correct?

That was my thinking. DataFrame.sort_index uses the same approach with lexsort_indexer and does not special case numpy dtypes.

lukemanley avatar Sep 10 '22 10:09 lukemanley

Ok, sounds reasonable

phofl avatar Sep 10 '22 10:09 phofl

Can you merge main to get ci green?

phofl avatar Sep 10 '22 11:09 phofl

Would be good to have a test that calls MultiIndex.argsort directly

added an explicit argsort test

lukemanley avatar Sep 13 '22 00:09 lukemanley

Just merged the other pr. Could you maybe run asvs once more? Union has a bunch of code paths, want to see if this is affected

phofl avatar Sep 13 '22 17:09 phofl

Just merged the other pr. Could you maybe run asvs once more? Union has a bunch of code paths, want to see if this is affected

I don’t think this touches union. Were you thinking of #48504 ?

lukemanley avatar Sep 13 '22 22:09 lukemanley

@phofl, @mroeschke

This is now failing following the merge of #48505. The test failing is test_union_sort_other_incomparable which expects to see a RunTimeWarning from MultiIndex._union. However, this uncovers some inconsistency in the main branch.

Here is an example:

import pandas as pd
idx = pd.MultiIndex.from_product([[pd.Timestamp("2000"), 1], ["a", "b"]])
df = pd.DataFrame({"A": 1}, index=idx)

# on main, this returns without warning/error as it goes through lexsort_indexer
df.sort_index()

# on main, this raises a TypeError as it goes through np.lexsort
df.index.sort_values()

With this PR the behavior will be consistent as we'll be sending both through lexsort_indexer. I've removed the expectation for a RunTimeWarning from the test. Let me know if that doesn't make sense.

lukemanley avatar Sep 14 '22 01:09 lukemanley

@phofl - gentle ping. I think this is ready.

lukemanley avatar Oct 06 '22 00:10 lukemanley

thx. @lukemanley

phofl avatar Oct 07 '22 11:10 phofl