pandas
pandas copied to clipboard
BUG: unstack with sort=False fails when used with the level parameter…
When sort = False, the previous implementation of unstack assumes that the data will appear in the new data frame in the same order as the old data, but this is not always true. Moreover, the code uses the sorted labels to map the nan values, but it led to a wrong result when sort = False.
To fix the first problem, I assign a 'code id' for each label in a way to simulate that they are already sorted. In this way, the indexer created will map the old data in the new one correctly. The second issue is solved simply by using the old labels.
- [x] closes #54987 (Replace xxxx with the GitHub issue number)
- [x] closes #55516
- [x] Tests added and passed if fixing a bug or adding a new feature
- [x] All code checks passed.
- [x] Added type annotations to new arguments/methods/functions.
- [ ] Added an entry in the latest
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.
Thanks for the PR @renanffernando. I only took a quick look, I think there are perhaps more performant ways of carrying out the changes you have here. Running the ASVs as-is but adding sort=False to the unstack benchmarks, I'm seeing this:
before after ratio
[14cf864c] [3b175962]
<main> <franco-unstack>
+ 913±2μs 10.9±0.02ms 11.93 reshape.ReshapeExtensionDtype.time_unstack_fast('Period[s]')
+ 932±9μs 10.9±0.02ms 11.72 reshape.ReshapeExtensionDtype.time_unstack_fast('datetime64[ns, US/Pacific]')
+ 1.71±0.01ms 11.7±0.05ms 6.87 reshape.ReshapeExtensionDtype.time_unstack_slow('datetime64[ns, US/Pacific]')
+ 1.70±0ms 11.7±0.05ms 6.85 reshape.ReshapeExtensionDtype.time_unstack_slow('Period[s]')
+ 1.47±0.01ms 10.0±0.02ms 6.79 reshape.SimpleReshape.time_unstack
+ 2.04±0.01ms 12.2±0.1ms 5.98 reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Int64')
+ 2.03±0.01ms 12.1±0.08ms 5.94 reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Float64')
+ 5.22±0.05ms 15.2±0.04ms 2.92 reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Float64')
+ 5.21±0.06ms 15.2±0.06ms 2.91 reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Int64')
+ 23.3±0.6ms 44.5±0.1ms 1.91 reshape.Unstack.time_full_product('int')
+ 925±2μs 1.69±0.01ms 1.82 reshape.SparseIndex.time_unstack
+ 16.6±0.3ms 25.4±0.3ms 1.53 reshape.Unstack.time_full_product('category')
+ 17.7±0.2ms 26.9±0.3ms 1.52 reshape.Unstack.time_without_last_row('category')
+ 52.5±0.3ms 74.7±0.2ms 1.42 reshape.Unstack.time_without_last_row('int')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.
Perhaps some of the performance can be clawed back - I'll be taking a more detailed look in the next few days.
I did not know about these performance tests @rhshadrach.
I did some experiments and the cause of this poor performance is the creation of the new label's codes. However, I tested some alternatives, but no one performed well. Anyway, I updated the mapping to the best version I found.
Do you know a better alternative to do this mapping?
Do you know a better alternative to do this mapping?
It sounds like you're looking for factorize with sort=False: https://pandas.pydata.org/docs/reference/api/pandas.factorize.html
I changed the code to use the factorize function as you recommended @rhshadrach, and the performance decrease is much better now. However, it still has a performance decrease by a factor of two in some tests, as I present below. Do you think that is ok?
| Change | Before [46c8da3e] <v2.3.0.dev0~122> | After [3bd4fdf3] |
Ratio | Benchmark (Parameter) |
|---|---|---|---|---|
| + | 26.9±0.1ms | 57.9±3ms | 2.16 | reshape.Unstack.time_full_product('int') |
| + | 2.35±0.01ms | 4.61±0.9ms | 1.96 | reshape.SimpleReshape.time_stack |
| + | 877±9μs | 1.58±0.2ms | 1.8 | reshape.SparseIndex.time_unstack |
| + | 65.0±1ms | 108±10ms | 1.67 | reshape.Unstack.time_without_last_row('int') |
| + | 1.42±0.02ms | 2.18±0.4ms | 1.53 | reshape.SimpleReshape.time_unstack |
| + | 1.96±0.01ms | 2.96±0.5ms | 1.51 | reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Float64') |
| + | 1.97±0.03ms | 2.87±0.4ms | 1.45 | reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Int64') |
| + | 874±4μs | 1.23±0.05ms | 1.41 | reshape.ReshapeExtensionDtype.time_unstack_fast('datetime64[ns, US/Pacific]') |
| + | 866±6μs | 1.15±0.03ms | 1.33 | reshape.ReshapeExtensionDtype.time_unstack_fast('Period[s]') |
| + | 5.16±0.1ms | 6.71±2ms | 1.3 | reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Int64') |
| + | 7.95±0.07ms | 10.2±4ms | 1.29 | reshape.ReshapeMaskedArrayDtype.time_stack('Int64') |
| + | 1.71±0.09ms | 2.18±0.2ms | 1.27 | reshape.ReshapeExtensionDtype.time_unstack_slow('Period[s]') |
| + | 3.33±0.02ms | 3.95±0.3ms | 1.19 | reshape.Cut.time_cut_datetime(4) |
| + | 16.5±0.3ms | 19.2±0.4ms | 1.17 | reshape.Unstack.time_full_product('category') |
| + | 7.94±0.06ms | 9.21±0.6ms | 1.16 | reshape.ReshapeMaskedArrayDtype.time_stack('Float64') |
| + | 40.1±0.06ms | 45.8±2ms | 1.14 | reshape.Crosstab.time_crosstab_normalize_margins |
| + | 4.17±0.02ms | 4.74±0.2ms | 1.14 | reshape.Cut.time_cut_datetime(10) |
| + | 105±2ms | 120±6ms | 1.14 | reshape.Pivot.time_reshape_pivot_time_series |
| + | 31.1±0.2ms | 34.9±2ms | 1.12 | reshape.PivotTable.time_pivot_table_margins_only_column |
| + | 3.78±0.02ms | 4.25±0.2ms | 1.12 | reshape.ReshapeExtensionDtype.time_stack('Period[s]') |
| + | 5.21±0.1ms | 5.79±0.5ms | 1.11 | reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Float64') |
| + | 27.7±0.2ms | 30.6±0.8ms | 1.1 | reshape.PivotTable.time_pivot_table_agg |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.
@renanffernando - are you interested in continuing here? If not, I plan to finish this up.
With the latest commit, I'm seeing:
asv continuous -f 1.1 -E virtualenv upstream/main HEAD -b "^reshape"
BENCHMARKS NOT SIGNIFICANTLY CHANGED.
@mroeschke - good to merge?
One comment about the test, the tests should stay, just removing the assertion on shares memory is the way to go
Thanks @renanffernando and @rhshadrach
Thanks for fixing. Is it possible to backport this fix to 2.2.x?
Is it possible to backport this fix to 2.2.x?
Backports are reserved for bugs introduce in recent prior versions, and from the issues it doesn't appear that this is the case.