pandas BUG: unstack with sort=False fails when used with the level parameter…

trafficstars

When sort = False, the previous implementation of unstack assumes that the data will appear in the new data frame in the same order as the old data, but this is not always true. Moreover, the code uses the sorted labels to map the nan values, but it led to a wrong result when sort = False.

To fix the first problem, I assign a 'code id' for each label in a way to simulate that they are already sorted. In this way, the indexer created will map the old data in the new one correctly. The second issue is solved simply by using the old labels.

[x] closes #54987 (Replace xxxx with the GitHub issue number)
[x] closes #55516
[x] Tests added and passed if fixing a bug or adding a new feature
[x] All code checks passed.
[x] Added type annotations to new arguments/methods/functions.
[ ] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Dec 06 '23 05:12 renanffernando

Thanks for the PR @renanffernando. I only took a quick look, I think there are perhaps more performant ways of carrying out the changes you have here. Running the ASVs as-is but adding sort=False to the unstack benchmarks, I'm seeing this:

       before           after         ratio
     [14cf864c]       [3b175962]
       <main>        <franco-unstack> 
+         913±2μs      10.9±0.02ms    11.93  reshape.ReshapeExtensionDtype.time_unstack_fast('Period[s]')
+         932±9μs      10.9±0.02ms    11.72  reshape.ReshapeExtensionDtype.time_unstack_fast('datetime64[ns, US/Pacific]')
+     1.71±0.01ms      11.7±0.05ms     6.87  reshape.ReshapeExtensionDtype.time_unstack_slow('datetime64[ns, US/Pacific]')
+        1.70±0ms      11.7±0.05ms     6.85  reshape.ReshapeExtensionDtype.time_unstack_slow('Period[s]')
+     1.47±0.01ms      10.0±0.02ms     6.79  reshape.SimpleReshape.time_unstack
+     2.04±0.01ms       12.2±0.1ms     5.98  reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Int64')
+     2.03±0.01ms      12.1±0.08ms     5.94  reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Float64')
+     5.22±0.05ms      15.2±0.04ms     2.92  reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Float64')
+     5.21±0.06ms      15.2±0.06ms     2.91  reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Int64')
+      23.3±0.6ms       44.5±0.1ms     1.91  reshape.Unstack.time_full_product('int')
+         925±2μs      1.69±0.01ms     1.82  reshape.SparseIndex.time_unstack
+      16.6±0.3ms       25.4±0.3ms     1.53  reshape.Unstack.time_full_product('category')
+      17.7±0.2ms       26.9±0.3ms     1.52  reshape.Unstack.time_without_last_row('category')
+      52.5±0.3ms       74.7±0.2ms     1.42  reshape.Unstack.time_without_last_row('int')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Perhaps some of the performance can be clawed back - I'll be taking a more detailed look in the next few days.

Dec 18 '23 22:12 rhshadrach

I did not know about these performance tests @rhshadrach.

I did some experiments and the cause of this poor performance is the creation of the new label's codes. However, I tested some alternatives, but no one performed well. Anyway, I updated the mapping to the best version I found.

Do you know a better alternative to do this mapping?

Dec 21 '23 03:12 renanffernando

Do you know a better alternative to do this mapping?

It sounds like you're looking for factorize with sort=False: https://pandas.pydata.org/docs/reference/api/pandas.factorize.html

Jan 02 '24 22:01 rhshadrach

I changed the code to use the factorize function as you recommended @rhshadrach, and the performance decrease is much better now. However, it still has a performance decrease by a factor of two in some tests, as I present below. Do you think that is ok?

Change	Before [46c8da3e] <v2.3.0.dev0~122>	After [3bd4fdf3]	Ratio	Benchmark (Parameter)
+	26.9±0.1ms	57.9±3ms	2.16	reshape.Unstack.time_full_product('int')
+	2.35±0.01ms	4.61±0.9ms	1.96	reshape.SimpleReshape.time_stack
+	877±9μs	1.58±0.2ms	1.8	reshape.SparseIndex.time_unstack
+	65.0±1ms	108±10ms	1.67	reshape.Unstack.time_without_last_row('int')
+	1.42±0.02ms	2.18±0.4ms	1.53	reshape.SimpleReshape.time_unstack
+	1.96±0.01ms	2.96±0.5ms	1.51	reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Float64')
+	1.97±0.03ms	2.87±0.4ms	1.45	reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Int64')
+	874±4μs	1.23±0.05ms	1.41	reshape.ReshapeExtensionDtype.time_unstack_fast('datetime64[ns, US/Pacific]')
+	866±6μs	1.15±0.03ms	1.33	reshape.ReshapeExtensionDtype.time_unstack_fast('Period[s]')
+	5.16±0.1ms	6.71±2ms	1.3	reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Int64')
+	7.95±0.07ms	10.2±4ms	1.29	reshape.ReshapeMaskedArrayDtype.time_stack('Int64')
+	1.71±0.09ms	2.18±0.2ms	1.27	reshape.ReshapeExtensionDtype.time_unstack_slow('Period[s]')
+	3.33±0.02ms	3.95±0.3ms	1.19	reshape.Cut.time_cut_datetime(4)
+	16.5±0.3ms	19.2±0.4ms	1.17	reshape.Unstack.time_full_product('category')
+	7.94±0.06ms	9.21±0.6ms	1.16	reshape.ReshapeMaskedArrayDtype.time_stack('Float64')
+	40.1±0.06ms	45.8±2ms	1.14	reshape.Crosstab.time_crosstab_normalize_margins
+	4.17±0.02ms	4.74±0.2ms	1.14	reshape.Cut.time_cut_datetime(10)
+	105±2ms	120±6ms	1.14	reshape.Pivot.time_reshape_pivot_time_series
+	31.1±0.2ms	34.9±2ms	1.12	reshape.PivotTable.time_pivot_table_margins_only_column
+	3.78±0.02ms	4.25±0.2ms	1.12	reshape.ReshapeExtensionDtype.time_stack('Period[s]')
+	5.21±0.1ms	5.79±0.5ms	1.11	reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Float64')
+	27.7±0.2ms	30.6±0.8ms	1.1	reshape.PivotTable.time_pivot_table_agg

Jan 24 '24 18:01 renanffernando

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Feb 28 '24 00:02 github-actions[bot]

@renanffernando - are you interested in continuing here? If not, I plan to finish this up.

Feb 28 '24 22:02 rhshadrach

With the latest commit, I'm seeing:

asv continuous -f 1.1 -E virtualenv upstream/main HEAD -b "^reshape"
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Mar 31 '24 13:03 rhshadrach

@mroeschke - good to merge?

May 16 '24 20:05 rhshadrach

One comment about the test, the tests should stay, just removing the assertion on shares memory is the way to go

May 16 '24 20:05 phofl

Thanks @renanffernando and @rhshadrach

May 21 '24 16:05 mroeschke

Thanks for fixing. Is it possible to backport this fix to 2.2.x?

Jun 18 '24 05:06 bluss

Is it possible to backport this fix to 2.2.x?

Backports are reserved for bugs introduce in recent prior versions, and from the issues it doesn't appear that this is the case.

Jun 18 '24 17:06 mroeschke

pandas pandas copied to clipboard

BUG: unstack with sort=False fails when used with the level parameter…

pandas
pandas copied to clipboard