pyjanitor
pyjanitor copied to clipboard
[ENH] `pivot_longer_spec`
PR Description
Please describe the changes proposed in the pull request:
- Improve performance when
sort_by_appearanceis True - Added
pivot_longer_spec, which allows unpivoting by hand - this allows more granular control on how the final dataframe should look in long form. - general refactoring
import pandas as pd; import janitor as jn
In [11]: events = pd.DataFrame(
...: {
...: "country": ["United States", "Russia", "China"],
...: "vault_2012_f": [
...: 48.132,
...: 46.366,
...: 44.266,
...: ],
...: "vault_2012_m": [46.632, 46.866, 48.316],
...: "vault_2016_f": [
...: 46.866,
...: 45.733,
...: 44.332,
...: ],
...: "vault_2016_m": [45.865, 46.033, 45.0],
...: "floor_2012_f": [45.366, 41.599, 40.833],
...: "floor_2012_m": [45.266, 45.308, 45.133],
...: "floor_2016_f": [45.999, 42.032, 42.066],
...: "floor_2016_m": [43.757, 44.766, 43.799],
...: }
...: )
In [12]: events
Out[12]:
country vault_2012_f vault_2012_m ... floor_2012_m floor_2016_f floor_2016_m
0 United States 48.132 46.632 ... 45.266 45.999 43.757
1 Russia 46.366 46.866 ... 45.308 42.032 44.766
2 China 44.266 48.316 ... 45.133 42.066 43.799
[3 rows x 9 columns]
events = pd.concat([events]*100_000)
# dev
In [848]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=False)
62.9 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [849]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=True)
165 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# PR
In [842]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=False)
53.2 ms ± 264 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [843]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=True)
48 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Performance test for lots of columns (YMMV):
events = pd.DataFrame(
...: {
...: "country": ["United States", "Russia", "China"],
...: "vault_2012_f": [
...: 48.132,
...: 46.366,
...: 44.266,
...: ],
...: "vault_2012_m": [46.632, 46.866, 48.316],
...: "vault_2016_f": [
...: 46.866,
...: 45.733,
...: 44.332,
...: ],
...: "vault_2016_m": [45.865, 46.033, 45.0],
...: "floor_2012_f": [45.366, 41.599, 40.833],
...: "floor_2012_m": [45.266, 45.308, 45.133],
...: "floor_2016_f": [45.999, 42.032, 42.066],
...: "floor_2016_m": [43.757, 44.766, 43.799],
...: }
...: )
events
country vault_2012_f vault_2012_m vault_2016_f vault_2016_m floor_2012_f floor_2012_m floor_2016_f floor_2016_m
0 United States 48.132 46.632 46.866 45.865 45.366 45.266 45.999 43.757
1 Russia 46.366 46.866 45.733 46.033 41.599 45.308 42.032 44.766
2 China 44.266 48.316 44.332 45.000 40.833 45.133 42.066 43.799
events = events.set_index('country')
events = pd.concat([events.add_suffix(f'_{num}') for num in range(100)],axis=1)
events = pd.concat([events]*10_000)
events = events.reset_index()
In [143]: events.shape
Out[143]: (30000, 801)
# dev
In [147]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'year':int}, sort_by_appearance=True)
2.85 s ± 34.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [148]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'year':int}, sort_by_appearance=False)
687 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# this PR
In [13]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'
...: year':int}, sort_by_appearance=True)
420 ms ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [14]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'
...: year':int}, sort_by_appearance=False)
470 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This PR resolves #1361 .
PR Checklist
Please ensure that you have done the following:
- [x] PR in from a fork off your branch. Do not PR from
<your_username>:dev, but rather from<your_username>:<feature-branch_name>.
- [x] If you're not on the contributors list, add yourself to
AUTHORS.md.
- [x] Add a line to
CHANGELOG.mdunder the latest version header (i.e. the one that is "on deck") describing the contribution.- Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.
Automatic checks
There will be automatic checks run on the PR. These include:
- Building a preview of the docs on Netlify
- Automatically linting the code
- Making sure the code is documented
- Making sure that all tests are passed
- Making sure that code coverage doesn't go down.
Relevant Reviewers
Please tag maintainers to review.
- @ericmjl
🚀 Deployed on https://deploy-preview-1362--pyjanitor.netlify.app
Ok, I just had a chance to look through the PR. Super high quality work! There was one file that was a tad too long where the implementation happened; I'm going to trust that it works fine. Otherwise, thank you for keeping the code test coverage high, @samukweku!
I am going to approve. Please do the honors of merging!
@ericmjl thanks for the feedback... I have to figure out how to break up such PRs into small chunks