pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

[ENH] improve performance for polars' `pivot_longer`

Open samukweku opened this issue 1 year ago • 2 comments

PR Description

Please describe the changes proposed in the pull request:

  • improve performance for pivot_longer - some cases can be 3x
  • use polars methods as much as possible
  • use implode/explode approach - work on small set of data and blow up only at the end (good perf benefits)
  • for lazyframes, if possible avoid .collect - use another option to avoid this and be as lazy for as long as possible

This PR relates to #1352 .

perf ... YMMV :

import polars as pl
import janitor.polars

evv = pl.read_csv('../evv.csv')
evv.shape
(30000, 801)
# dev 
 %timeit evv.janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep='_')
1.5 s ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit evv.lazy().janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep="_")
3 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit evv.lazy().janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep="_").collect()
5.94 s ± 24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# this PR
%timeit evv.janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep="_")
225 ms ± 8.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit evv.lazy().janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep="_")
1.58 ms ± 4.36 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit evv.lazy().janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep="_").collect()
263 ms ± 8.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

samukweku avatar Jun 18 '24 03:06 samukweku

🚀 Deployed on https://deploy-preview-1377--pyjanitor.netlify.app

ericmjl avatar Jun 18 '24 03:06 ericmjl

Codecov Report

Attention: Patch coverage is 95.74468% with 4 lines in your changes missing coverage. Please review.

Project coverage is 88.96%. Comparing base (62c57c6) to head (6a5f66e). Report is 27 commits behind head on dev.

:exclamation: Current head 6a5f66e differs from pull request most recent head 1fc553e

Please upload reports for the commit 1fc553e to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1377      +/-   ##
==========================================
- Coverage   94.48%   88.96%   -5.52%     
==========================================
  Files          80       86       +6     
  Lines        4367     5058     +691     
==========================================
+ Hits         4126     4500     +374     
- Misses        241      558     +317     

codecov[bot] avatar Jun 20 '24 23:06 codecov[bot]

@ericmjl Ok to do a release?

samukweku avatar Jul 04 '24 21:07 samukweku