pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: unstack with sort=False fails when used with the level parameter…

Open renanffernando opened this issue 1 year ago • 9 comments
trafficstars

When sort = False, the previous implementation of unstack assumes that the data will appear in the new data frame in the same order as the old data, but this is not always true. Moreover, the code uses the sorted labels to map the nan values, but it led to a wrong result when sort = False.

To fix the first problem, I assign a 'code id' for each label in a way to simulate that they are already sorted. In this way, the indexer created will map the old data in the new one correctly. The second issue is solved simply by using the old labels.

  • [x] closes #54987 (Replace xxxx with the GitHub issue number)
  • [x] closes #55516
  • [x] Tests added and passed if fixing a bug or adding a new feature
  • [x] All code checks passed.
  • [x] Added type annotations to new arguments/methods/functions.
  • [ ] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

renanffernando avatar Dec 06 '23 05:12 renanffernando

Thanks for the PR @renanffernando. I only took a quick look, I think there are perhaps more performant ways of carrying out the changes you have here. Running the ASVs as-is but adding sort=False to the unstack benchmarks, I'm seeing this:

       before           after         ratio
     [14cf864c]       [3b175962]
       <main>        <franco-unstack> 
+         913±2μs      10.9±0.02ms    11.93  reshape.ReshapeExtensionDtype.time_unstack_fast('Period[s]')
+         932±9μs      10.9±0.02ms    11.72  reshape.ReshapeExtensionDtype.time_unstack_fast('datetime64[ns, US/Pacific]')
+     1.71±0.01ms      11.7±0.05ms     6.87  reshape.ReshapeExtensionDtype.time_unstack_slow('datetime64[ns, US/Pacific]')
+        1.70±0ms      11.7±0.05ms     6.85  reshape.ReshapeExtensionDtype.time_unstack_slow('Period[s]')
+     1.47±0.01ms      10.0±0.02ms     6.79  reshape.SimpleReshape.time_unstack
+     2.04±0.01ms       12.2±0.1ms     5.98  reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Int64')
+     2.03±0.01ms      12.1±0.08ms     5.94  reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Float64')
+     5.22±0.05ms      15.2±0.04ms     2.92  reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Float64')
+     5.21±0.06ms      15.2±0.06ms     2.91  reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Int64')
+      23.3±0.6ms       44.5±0.1ms     1.91  reshape.Unstack.time_full_product('int')
+         925±2μs      1.69±0.01ms     1.82  reshape.SparseIndex.time_unstack
+      16.6±0.3ms       25.4±0.3ms     1.53  reshape.Unstack.time_full_product('category')
+      17.7±0.2ms       26.9±0.3ms     1.52  reshape.Unstack.time_without_last_row('category')
+      52.5±0.3ms       74.7±0.2ms     1.42  reshape.Unstack.time_without_last_row('int')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Perhaps some of the performance can be clawed back - I'll be taking a more detailed look in the next few days.

rhshadrach avatar Dec 18 '23 22:12 rhshadrach

I did not know about these performance tests @rhshadrach.

I did some experiments and the cause of this poor performance is the creation of the new label's codes. However, I tested some alternatives, but no one performed well. Anyway, I updated the mapping to the best version I found.

Do you know a better alternative to do this mapping?

renanffernando avatar Dec 21 '23 03:12 renanffernando

Do you know a better alternative to do this mapping?

It sounds like you're looking for factorize with sort=False: https://pandas.pydata.org/docs/reference/api/pandas.factorize.html

rhshadrach avatar Jan 02 '24 22:01 rhshadrach

I changed the code to use the factorize function as you recommended @rhshadrach, and the performance decrease is much better now. However, it still has a performance decrease by a factor of two in some tests, as I present below. Do you think that is ok?

Change Before [46c8da3e] <v2.3.0.dev0~122> After [3bd4fdf3] Ratio Benchmark (Parameter)
+ 26.9±0.1ms 57.9±3ms 2.16 reshape.Unstack.time_full_product('int')
+ 2.35±0.01ms 4.61±0.9ms 1.96 reshape.SimpleReshape.time_stack
+ 877±9μs 1.58±0.2ms 1.8 reshape.SparseIndex.time_unstack
+ 65.0±1ms 108±10ms 1.67 reshape.Unstack.time_without_last_row('int')
+ 1.42±0.02ms 2.18±0.4ms 1.53 reshape.SimpleReshape.time_unstack
+ 1.96±0.01ms 2.96±0.5ms 1.51 reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Float64')
+ 1.97±0.03ms 2.87±0.4ms 1.45 reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Int64')
+ 874±4μs 1.23±0.05ms 1.41 reshape.ReshapeExtensionDtype.time_unstack_fast('datetime64[ns, US/Pacific]')
+ 866±6μs 1.15±0.03ms 1.33 reshape.ReshapeExtensionDtype.time_unstack_fast('Period[s]')
+ 5.16±0.1ms 6.71±2ms 1.3 reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Int64')
+ 7.95±0.07ms 10.2±4ms 1.29 reshape.ReshapeMaskedArrayDtype.time_stack('Int64')
+ 1.71±0.09ms 2.18±0.2ms 1.27 reshape.ReshapeExtensionDtype.time_unstack_slow('Period[s]')
+ 3.33±0.02ms 3.95±0.3ms 1.19 reshape.Cut.time_cut_datetime(4)
+ 16.5±0.3ms 19.2±0.4ms 1.17 reshape.Unstack.time_full_product('category')
+ 7.94±0.06ms 9.21±0.6ms 1.16 reshape.ReshapeMaskedArrayDtype.time_stack('Float64')
+ 40.1±0.06ms 45.8±2ms 1.14 reshape.Crosstab.time_crosstab_normalize_margins
+ 4.17±0.02ms 4.74±0.2ms 1.14 reshape.Cut.time_cut_datetime(10)
+ 105±2ms 120±6ms 1.14 reshape.Pivot.time_reshape_pivot_time_series
+ 31.1±0.2ms 34.9±2ms 1.12 reshape.PivotTable.time_pivot_table_margins_only_column
+ 3.78±0.02ms 4.25±0.2ms 1.12 reshape.ReshapeExtensionDtype.time_stack('Period[s]')
+ 5.21±0.1ms 5.79±0.5ms 1.11 reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Float64')
+ 27.7±0.2ms 30.6±0.8ms 1.1 reshape.PivotTable.time_pivot_table_agg

renanffernando avatar Jan 24 '24 18:01 renanffernando

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

github-actions[bot] avatar Feb 28 '24 00:02 github-actions[bot]

@renanffernando - are you interested in continuing here? If not, I plan to finish this up.

rhshadrach avatar Feb 28 '24 22:02 rhshadrach

With the latest commit, I'm seeing:

asv continuous -f 1.1 -E virtualenv upstream/main HEAD -b "^reshape"
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

rhshadrach avatar Mar 31 '24 13:03 rhshadrach

@mroeschke - good to merge?

rhshadrach avatar May 16 '24 20:05 rhshadrach

One comment about the test, the tests should stay, just removing the assertion on shares memory is the way to go

phofl avatar May 16 '24 20:05 phofl

Thanks @renanffernando and @rhshadrach

mroeschke avatar May 21 '24 16:05 mroeschke

Thanks for fixing. Is it possible to backport this fix to 2.2.x?

bluss avatar Jun 18 '24 05:06 bluss

Is it possible to backport this fix to 2.2.x?

Backports are reserved for bugs introduce in recent prior versions, and from the issues it doesn't appear that this is the case.

mroeschke avatar Jun 18 '24 17:06 mroeschke