padr icon indicating copy to clipboard operation
padr copied to clipboard

Performance improvement for pad()

Open joranE opened this issue 2 years ago • 1 comments

I was profiling some code that calls padr::pad() and found that span_all_groups() was using a significant chunk of the total time. This proposed change converts span_all_groups() to use purrr::map2 and tidyr::unnest() to accomplish the same thing. In my tests it is ~2x faster than the original span_all_groups() on my M1 Max Apple laptop and ~3x faster on the virtual Linux machine I use for work. It's a small absolute improvement on my Mac (<0.5sec) but a more noticeable improvement on my slower work Linux machine (~1.5sec), at least in my context, where I'm doing some data processing while loading a Shiny app.

A tradeoff is the additional imports of purrr::map2 and tidyr::unnest. I noticed that padr seems to stick to base R as much as possible, so I'll understand if you're reluctant to add the additional dependencies.

Unrelated, but I think the existing stop_int64() call isn't actually doing anything, since it's being called on the list output from split() and so it's looking at the classes of the individual list elements rather than the columns in the original data frame.

All existing tests in the package are passing with this change.

joranE avatar Sep 30 '22 19:09 joranE

Thanks a lot for taking the time to seek improvements in padr. For the moment I am pressed for time. Once I have the resources to give padr some love again I will study your proposal.

EdwinTh avatar Oct 03 '22 07:10 EdwinTh