poorman
poorman copied to clipboard
Improve speed of `group_by()`
Related to #119.
Benchmark setup:
d <- data.frame(
g1 = sample(LETTERS, 8000, TRUE),
g2 = sample(LETTERS, 8000, TRUE),
g3 = sample(LETTERS, 8000, TRUE),
x1 = runif(8000),
x2 = runif(8000),
x3 = runif(8000)
)
# return a list of results so that both functions return the same output (without
# all the class problem, tibble vs data.frame)
poor = function() {
foo <- d %>%
group_by(g1, g2, g3) |>
summarise(x1 = mean(x1), x2 = max(x2), x3 = min(x3))
list(foo$x1, foo$x2, foo$x3)
}
dpl = function() {
foo <- d %>%
dplyr::group_by(g1, g2, g3) |>
dplyr::summarise(x1 = mean(x1), x2 = max(x2), x3 = min(x3))
list(foo$x1, foo$x2, foo$x3)
}
bench::mark(
poor(),
dpl()
)
Before:
#> # A tibble: 2 × 13
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
#> 1 poor() 11.6s 11.6s 0.0863 1.83GB 11.0 1 127 11.6s <list [3]> <Rprofmem> <bench_tm [1]> <tibble>
#> 2 dpl() 157.2ms 178.3ms 5.77 7.34MB 5.77 3 3 520.2ms <list [3]> <Rprofmem> <bench_tm [3]> <tibble>
After:
#> # A tibble: 2 × 13
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
#> 1 poor() 7.77s 7.77s 0.129 1.83GB 9.78 1 76 7.77s <list [3]> <Rprofmem> <bench_tm [1]> <tibble>
#> 2 dpl() 112.42ms 152.17ms 6.66 3.39MB 3.33 4 2 601.02ms <list [3]> <Rprofmem> <bench_tm [4]> <tibble>
Codecov Report
Base: 93.59% // Head: 93.60% // Increases project coverage by +0.01%
:tada:
Coverage data is based on head (
e6e28e8
) compared to base (3cc0a99
). Patch coverage: 100.00% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## master #121 +/- ##
==========================================
+ Coverage 93.59% 93.60% +0.01%
==========================================
Files 59 59
Lines 1483 1486 +3
==========================================
+ Hits 1388 1391 +3
Misses 95 95
Impacted Files | Coverage Δ | |
---|---|---|
R/group_utils.R | 95.12% <100.00%> (+0.38%) |
:arrow_up: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
TODO: the following code doesn't produce the right output (note that the output is also wrong without this PR):
suppressPackageStartupMessages(library(poorman))
d <- data.frame(
orig = rep(c("France", "UK"), each = 4),
dest = rep(c("Spain", "Germany"), times = 4),
year = rep(rep(c(2010, 2011), each = 2), 2),
value = 1:8
)
d[2, 1] <- NA
d[7, 2] <- NA
d
#> orig dest year value
#> 1 France Spain 2010 1
#> 2 <NA> Germany 2010 2
#> 3 France Spain 2011 3
#> 4 France Germany 2011 4
#> 5 UK Spain 2010 5
#> 6 UK Germany 2010 6
#> 7 UK <NA> 2011 7
#> 8 UK Germany 2011 8
d %>%
group_by(orig, dest) %>%
group_data()
#> orig dest .rows
#> 1 France Germany 4
#> 2 France Spain 1, 3
#> 3 UK Germany 6, 8
#> 4 UK Spain 5
#> 5 UK <NA>
#> 6 <NA> Germany 2, 7
The groups UK-NA and NA-Germany should have 1 row each.
I've been away. I had every intention of reviewing this this evening but I spent it catching up on other things. I'll try and look tomorrow after work.