PR Description

Please describe the changes proposed in the pull request:

Improve performance when sort_by_appearance is True
Added pivot_longer_spec, which allows unpivoting by hand - this allows more granular control on how the final dataframe should look in long form.
general refactoring

import pandas as pd; import janitor as jn

In [11]: events = pd.DataFrame(
    ...:             {
    ...:                 "country": ["United States", "Russia", "China"],
    ...:                 "vault_2012_f": [
    ...:                     48.132,
    ...:                     46.366,
    ...:                     44.266,
    ...:                 ],
    ...:                 "vault_2012_m": [46.632, 46.866, 48.316],
    ...:                 "vault_2016_f": [
    ...:                     46.866,
    ...:                     45.733,
    ...:                     44.332,
    ...:                 ],
    ...:                 "vault_2016_m": [45.865, 46.033, 45.0],
    ...:                 "floor_2012_f": [45.366, 41.599, 40.833],
    ...:                 "floor_2012_m": [45.266, 45.308, 45.133],
    ...:                 "floor_2016_f": [45.999, 42.032, 42.066],
    ...:                 "floor_2016_m": [43.757, 44.766, 43.799],
    ...:             }
    ...:         )

In [12]: events
Out[12]:
         country  vault_2012_f  vault_2012_m  ...  floor_2012_m  floor_2016_f  floor_2016_m
0  United States        48.132        46.632  ...        45.266        45.999        43.757
1         Russia        46.366        46.866  ...        45.308        42.032        44.766
2          China        44.266        48.316  ...        45.133        42.066        43.799

[3 rows x 9 columns]

events = pd.concat([events]*100_000)

# dev
In [848]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=False)
62.9 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [849]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=True)
165 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


# PR
In [842]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=False)
53.2 ms ± 264 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [843]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=True)
48 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Performance test for lots of columns (YMMV):

events = pd.DataFrame(
     ...:             {
     ...:                 "country": ["United States", "Russia", "China"],
     ...:                 "vault_2012_f": [
     ...:                     48.132,
     ...:                     46.366,
     ...:                     44.266,
     ...:                 ],
     ...:                 "vault_2012_m": [46.632, 46.866, 48.316],
     ...:                 "vault_2016_f": [
     ...:                     46.866,
     ...:                     45.733,
     ...:                     44.332,
     ...:                 ],
     ...:                 "vault_2016_m": [45.865, 46.033, 45.0],
     ...:                 "floor_2012_f": [45.366, 41.599, 40.833],
     ...:                 "floor_2012_m": [45.266, 45.308, 45.133],
     ...:                 "floor_2016_f": [45.999, 42.032, 42.066],
     ...:                 "floor_2016_m": [43.757, 44.766, 43.799],
     ...:             }
     ...:         )

events
         country  vault_2012_f  vault_2012_m  vault_2016_f  vault_2016_m  floor_2012_f  floor_2012_m  floor_2016_f  floor_2016_m
0  United States        48.132        46.632        46.866        45.865        45.366        45.266        45.999        43.757
1         Russia        46.366        46.866        45.733        46.033        41.599        45.308        42.032        44.766
2          China        44.266        48.316        44.332        45.000        40.833        45.133        42.066        43.799

events = events.set_index('country')
events = pd.concat([events.add_suffix(f'_{num}') for num in range(100)],axis=1)
events = pd.concat([events]*10_000)
events = events.reset_index()
In [143]: events.shape
Out[143]: (30000, 801)

# dev 
In [147]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'year':int}, sort_by_appearance=True)
2.85 s ± 34.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [148]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'year':int}, sort_by_appearance=False)
687 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# this PR
 In [13]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'
    ...: year':int}, sort_by_appearance=True)
420 ms ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'
    ...: year':int}, sort_by_appearance=False)
470 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR resolves #1361 .

PR Checklist

Please ensure that you have done the following:

[x] PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.

[x] If you're not on the contributors list, add yourself to AUTHORS.md.

[x] Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
- Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

Building a preview of the docs on Netlify
Automatically linting the code
Making sure the code is documented
Making sure that all tests are passed
Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@ericmjl

May 10 '24 02:05 samukweku

🚀 Deployed on https://deploy-preview-1362--pyjanitor.netlify.app

May 10 '24 02:05 ericmjl

Ok, I just had a chance to look through the PR. Super high quality work! There was one file that was a tad too long where the implementation happened; I'm going to trust that it works fine. Otherwise, thank you for keeping the code test coverage high, @samukweku!

Jul 03 '24 00:07 ericmjl

I am going to approve. Please do the honors of merging!

Jul 03 '24 00:07 ericmjl

@ericmjl thanks for the feedback... I have to figure out how to break up such PRs into small chunks

Jul 03 '24 06:07 samukweku

pyjanitor
pyjanitor copied to clipboard

[ENH] `pivot_longer_spec`

PR Description

PR Checklist

Automatic checks

Relevant Reviewers

pyjanitor pyjanitor copied to clipboard

[ENH] `pivot_longer_spec`

PR Description

PR Checklist

Automatic checks

Relevant Reviewers

pyjanitor
pyjanitor copied to clipboard