pyjanitor
pyjanitor copied to clipboard
[ENH] improve performance for polars' `pivot_longer`
PR Description
Please describe the changes proposed in the pull request:
- improve performance for
pivot_longer- some cases can be 3x - use polars methods as much as possible
- use implode/explode approach - work on small set of data and blow up only at the end (good perf benefits)
- for lazyframes, if possible avoid
.collect- use another option to avoid this and be as lazy for as long as possible
This PR relates to #1352 .
perf ... YMMV :
import polars as pl
import janitor.polars
evv = pl.read_csv('../evv.csv')
evv.shape
(30000, 801)
# dev
%timeit evv.janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep='_')
1.5 s ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit evv.lazy().janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep="_")
3 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit evv.lazy().janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep="_").collect()
5.94 s ± 24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# this PR
%timeit evv.janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep="_")
225 ms ± 8.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit evv.lazy().janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep="_")
1.58 ms ± 4.36 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit evv.lazy().janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep="_").collect()
263 ms ± 8.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
🚀 Deployed on https://deploy-preview-1377--pyjanitor.netlify.app
Codecov Report
Attention: Patch coverage is 95.74468% with 4 lines in your changes missing coverage. Please review.
Project coverage is 88.96%. Comparing base (
62c57c6) to head (6a5f66e). Report is 27 commits behind head on dev.
:exclamation: Current head 6a5f66e differs from pull request most recent head 1fc553e
Please upload reports for the commit 1fc553e to get more accurate results.
Additional details and impacted files
@@ Coverage Diff @@
## dev #1377 +/- ##
==========================================
- Coverage 94.48% 88.96% -5.52%
==========================================
Files 80 86 +6
Lines 4367 5058 +691
==========================================
+ Hits 4126 4500 +374
- Misses 241 558 +317
@ericmjl Ok to do a release?