pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

[ENH] `pivot_longer_spec`

Open samukweku opened this issue 1 year ago • 1 comments

PR Description

Please describe the changes proposed in the pull request:

  • Improve performance when sort_by_appearance is True
  • Added pivot_longer_spec, which allows unpivoting by hand - this allows more granular control on how the final dataframe should look in long form.
  • general refactoring
import pandas as pd; import janitor as jn

In [11]: events = pd.DataFrame(
    ...:             {
    ...:                 "country": ["United States", "Russia", "China"],
    ...:                 "vault_2012_f": [
    ...:                     48.132,
    ...:                     46.366,
    ...:                     44.266,
    ...:                 ],
    ...:                 "vault_2012_m": [46.632, 46.866, 48.316],
    ...:                 "vault_2016_f": [
    ...:                     46.866,
    ...:                     45.733,
    ...:                     44.332,
    ...:                 ],
    ...:                 "vault_2016_m": [45.865, 46.033, 45.0],
    ...:                 "floor_2012_f": [45.366, 41.599, 40.833],
    ...:                 "floor_2012_m": [45.266, 45.308, 45.133],
    ...:                 "floor_2016_f": [45.999, 42.032, 42.066],
    ...:                 "floor_2016_m": [43.757, 44.766, 43.799],
    ...:             }
    ...:         )

In [12]: events
Out[12]:
         country  vault_2012_f  vault_2012_m  ...  floor_2012_m  floor_2016_f  floor_2016_m
0  United States        48.132        46.632  ...        45.266        45.999        43.757
1         Russia        46.366        46.866  ...        45.308        42.032        44.766
2          China        44.266        48.316  ...        45.133        42.066        43.799

[3 rows x 9 columns]

events = pd.concat([events]*100_000)

# dev
In [848]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=False)
62.9 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [849]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=True)
165 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


# PR
In [842]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=False)
53.2 ms ± 264 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [843]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=True)
48 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Performance test for lots of columns (YMMV):

events = pd.DataFrame(
     ...:             {
     ...:                 "country": ["United States", "Russia", "China"],
     ...:                 "vault_2012_f": [
     ...:                     48.132,
     ...:                     46.366,
     ...:                     44.266,
     ...:                 ],
     ...:                 "vault_2012_m": [46.632, 46.866, 48.316],
     ...:                 "vault_2016_f": [
     ...:                     46.866,
     ...:                     45.733,
     ...:                     44.332,
     ...:                 ],
     ...:                 "vault_2016_m": [45.865, 46.033, 45.0],
     ...:                 "floor_2012_f": [45.366, 41.599, 40.833],
     ...:                 "floor_2012_m": [45.266, 45.308, 45.133],
     ...:                 "floor_2016_f": [45.999, 42.032, 42.066],
     ...:                 "floor_2016_m": [43.757, 44.766, 43.799],
     ...:             }
     ...:         )

events
         country  vault_2012_f  vault_2012_m  vault_2016_f  vault_2016_m  floor_2012_f  floor_2012_m  floor_2016_f  floor_2016_m
0  United States        48.132        46.632        46.866        45.865        45.366        45.266        45.999        43.757
1         Russia        46.366        46.866        45.733        46.033        41.599        45.308        42.032        44.766
2          China        44.266        48.316        44.332        45.000        40.833        45.133        42.066        43.799

events = events.set_index('country')
events = pd.concat([events.add_suffix(f'_{num}') for num in range(100)],axis=1)
events = pd.concat([events]*10_000)
events = events.reset_index()
In [143]: events.shape
Out[143]: (30000, 801)

# dev 
In [147]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'year':int}, sort_by_appearance=True)
2.85 s ± 34.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [148]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'year':int}, sort_by_appearance=False)
687 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# this PR
 In [13]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'
    ...: year':int}, sort_by_appearance=True)
420 ms ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'
    ...: year':int}, sort_by_appearance=False)
470 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR resolves #1361 .

PR Checklist

Please ensure that you have done the following:

  1. [x] PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. [x] If you're not on the contributors list, add yourself to AUTHORS.md.
  1. [x] Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

  • @ericmjl

samukweku avatar May 10 '24 02:05 samukweku

🚀 Deployed on https://deploy-preview-1362--pyjanitor.netlify.app

ericmjl avatar May 10 '24 02:05 ericmjl

Ok, I just had a chance to look through the PR. Super high quality work! There was one file that was a tad too long where the implementation happened; I'm going to trust that it works fine. Otherwise, thank you for keeping the code test coverage high, @samukweku!

ericmjl avatar Jul 03 '24 00:07 ericmjl

I am going to approve. Please do the honors of merging!

ericmjl avatar Jul 03 '24 00:07 ericmjl

@ericmjl thanks for the feedback... I have to figure out how to break up such PRs into small chunks

samukweku avatar Jul 03 '24 06:07 samukweku