poorman icon indicating copy to clipboard operation
poorman copied to clipboard

Improve speed of `group_by()`

Open etiennebacher opened this issue 2 years ago • 3 comments

Related to #119.

Benchmark setup:

d <- data.frame(
  g1 = sample(LETTERS, 8000, TRUE),
  g2 = sample(LETTERS, 8000, TRUE),
  g3 = sample(LETTERS, 8000, TRUE),
  x1 = runif(8000),
  x2 = runif(8000),
  x3 = runif(8000)
)

# return a list of results so that both functions return the same output (without
# all the class problem, tibble vs data.frame)
poor = function() {
  foo <- d %>% 
    group_by(g1, g2, g3) |> 
    summarise(x1 = mean(x1), x2 = max(x2), x3 = min(x3))

  list(foo$x1, foo$x2, foo$x3)
}

dpl = function() {
  foo <- d %>% 
    dplyr::group_by(g1, g2, g3) |> 
    dplyr::summarise(x1 = mean(x1), x2 = max(x2), x3 = min(x3))

  list(foo$x1, foo$x2, foo$x3)
}

bench::mark(
  poor(),
  dpl()
)

Before:

#> # A tibble: 2 × 13
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result     memory     time           gc      
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>     <list>     <list>         <list>  
#> 1 poor()        11.6s    11.6s    0.0863    1.83GB    11.0      1   127      11.6s <list [3]> <Rprofmem> <bench_tm [1]> <tibble>
#> 2 dpl()       157.2ms  178.3ms    5.77      7.34MB     5.77     3     3    520.2ms <list [3]> <Rprofmem> <bench_tm [3]> <tibble>

After:

#> # A tibble: 2 × 13
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result     memory     time           gc      
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>     <list>     <list>         <list>  
#> 1 poor()        7.77s    7.77s     0.129    1.83GB     9.78     1    76      7.77s <list [3]> <Rprofmem> <bench_tm [1]> <tibble>
#> 2 dpl()      112.42ms 152.17ms     6.66     3.39MB     3.33     4     2   601.02ms <list [3]> <Rprofmem> <bench_tm [4]> <tibble>

etiennebacher avatar Jan 06 '23 09:01 etiennebacher

Codecov Report

Base: 93.59% // Head: 93.60% // Increases project coverage by +0.01% :tada:

Coverage data is based on head (e6e28e8) compared to base (3cc0a99). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #121      +/-   ##
==========================================
+ Coverage   93.59%   93.60%   +0.01%     
==========================================
  Files          59       59              
  Lines        1483     1486       +3     
==========================================
+ Hits         1388     1391       +3     
  Misses         95       95              
Impacted Files Coverage Δ
R/group_utils.R 95.12% <100.00%> (+0.38%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar Jan 06 '23 12:01 codecov-commenter

TODO: the following code doesn't produce the right output (note that the output is also wrong without this PR):

suppressPackageStartupMessages(library(poorman))

d <- data.frame(
  orig = rep(c("France", "UK"), each = 4),
  dest = rep(c("Spain", "Germany"), times = 4),
  year = rep(rep(c(2010, 2011), each = 2), 2),
  value = 1:8
)
d[2, 1] <- NA
d[7, 2] <- NA

d
#>     orig    dest year value
#> 1 France   Spain 2010     1
#> 2   <NA> Germany 2010     2
#> 3 France   Spain 2011     3
#> 4 France Germany 2011     4
#> 5     UK   Spain 2010     5
#> 6     UK Germany 2010     6
#> 7     UK    <NA> 2011     7
#> 8     UK Germany 2011     8

d %>%
  group_by(orig, dest) %>%
  group_data()
#>     orig    dest .rows
#> 1 France Germany     4
#> 2 France   Spain  1, 3
#> 3     UK Germany  6, 8
#> 4     UK   Spain     5
#> 5     UK    <NA>      
#> 6   <NA> Germany  2, 7

The groups UK-NA and NA-Germany should have 1 row each.

etiennebacher avatar Jan 06 '23 12:01 etiennebacher

I've been away. I had every intention of reviewing this this evening but I spent it catching up on other things. I'll try and look tomorrow after work.

nathaneastwood avatar Jan 16 '23 21:01 nathaneastwood