efficientR 1.6.2. - microbenchmark's multcomp reference missing

The code snippet in 1.6.2 for microbenchmark shows cld as a column in the output. cld only shows up as a column if multcomp is already installed (see microbenchmark doc).

Earlier in the chapter, it is suggested to install microbenchmark, but not this companion package.

Snippet from 1.6.2:

library("microbenchmark")
df = data.frame(v = 1:4, name = letters[1:4])
microbenchmark(df[3, 2], df[3, "name"], df$name[3])
# Unit: microseconds
#          expr     min    lq  mean median    uq   max neval cld
#      df[3, 2]   17.99 18.96 20.16  19.38 19.77 35.14   100   b
# df[3, "name"]   17.97 19.13 21.45  19.64 20.15 74.00   100   b
#    df$name[3]   12.48 13.81 15.81  14.48 15.14 67.24   100   a

May 03 '20 07:05 engineerchange

Note the code snippet in 1.6.3.1 doesn't show the microbenchmark output with the cld column.

May 03 '20 07:05 engineerchange

I suggest we switch to using the bench package. Sound reasonable @engineerchange and @csgillespie ?

May 03 '20 11:05 Robinlovelace

@Robinlovelace Switching to bench does involve a fair bit of work, as it's used throughout the book. Also, not entirely convinced that it's "better". Thoughts?

May 10 '20 17:05 csgillespie

Thoughts?

Bench tells you about memory use which is important if not vital when fixing critical performance issues. I'm a fan. No need to rush it, interested to hear what others have to say. I know @symbolixAU, for example, has not switched to bench.

I think the +s outweigh the -s, that this is an important but not priority issue. So suggest keeping it open for now.

May 10 '20 20:05 Robinlovelace

I (speaking as @SymbolixAU and myself) dislike bench because the default implementation mixes units in columns

library(bench)


foo <- function() Sys.sleep(5)
bar <- function() Sys.sleep(0.1)

bench::mark(foo(), bar())

# # A tibble: 2 x 13
# expression          min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory           time     gc              
# <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>           <list>   <list>          
#   1 foo()            5s       5s     0.200        0B        0     1     0         5s <NULL> <df[,3] [0 × 3]> <bch:tm> <tibble [1 × 3]>
#   2 bar()         100ms    100ms     9.92         0B        0     5     0      504ms <NULL> <df[,3] [0 × 3]> <bch:tm> <tibble [5 × 3]>
#

Another example is here

May 10 '20 21:05 dcooley

I'm but a practitioner, but I agree with @dcooley on the above. Mixing units by default is an odd behaviour, and given that my issue got closed immediately, it doesn't appear like the maintainer(s) will change this behavior in the near future.

May 11 '20 06:05 engineerchange

Perhaps the compromise is to suggest bench as a benchmarking alternative in Chapter 1 when microbenchmark is introduced; I do think memory use is an important feature that shouldn't be overlooked.

May 11 '20 06:05 engineerchange

That would be quicker also than changing every instance of benchmarking. It is just a question of setting an argument so if we did switch to the bench in the longer term we could set the time_unit every time, e.g. with:

library(bench)


foo <- function() Sys.sleep(5)
bar <- function() Sys.sleep(0.1)

bench::mark(foo(), bar(), time_unit = "s")
#> # A tibble: 2 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 foo()      5.01   5.01      0.199     2.1KB        0
#> 2 bar()      0.100  0.100     9.97         0B        0

^{Created on 2020-05-11 by the reprex package (v0.3.0)}

It's great to be discussing this, has already made me learn something new, I will use this argument in my use of bench going forward.

May 11 '20 08:05 Robinlovelace

This doesn't seems to be mentioned so far, but memory tracking in bench is not reliable. It only tracks memory allocated with R's functions (including R's C functions of course). Code is R packages don't have to use those, in fact if you need to allocate within parallel region in your C code you must not use those. That means it is not possible to measure memory with Rprof reliably, which bench use. AFAIR it uses now something more(/else?), not sure how reliable it is. When I had to measure memory usage I used cgmemtime which seems pretty accurate, but is not user friendly. As for mixing time units by default, it can lead to really not easy to spot cases:

  expression                  min
  <bch:expr>             <bch:tm>       
1 CJ.dt(DT1, DT2, DT3)      1.52m
2 CJ.dt_1(DT1, DT2, DT3)    1.53s
3 CJ.dt_2(DT1, DT2, DT3)    1.27s
4 CJDT(DT1, DT2, DT3)       1.07s

another pluses for microbenchmark are that it is very lightweight and its API and codebase is mature and stable.

Jun 13 '20 21:06 jangorecki

Thanks @jangorecki those are good reasons for sticking with the status quo!

Jun 14 '20 08:06 Robinlovelace

On a related note, I think with dplyr 1.0.0 and recent improvements in data.table it's time to refactor https://csgillespie.github.io/efficientR/data-carpentry.html. Any suggestions on that are very welcome @jangorecki !

Jun 14 '20 08:06 Robinlovelace

@Robinlovelace all looks to be up to date. I would only mention how base R changing with square bracket can be used with data.table due to its local scope.

Aug 26 '20 06:08 jangorecki

Thanks @jangorecki, going back to the original issue, the solutions would seem to be keep the cld column or remove it in every benchmark. Which do you recommend?

The code snippet in 1.6.2 for microbenchmark shows cld as a column in the output.

Aug 26 '20 08:08 Robinlovelace

I would remove it, to keep things more simple.

Aug 26 '20 08:08 jangorecki

efficientR efficientR copied to clipboard

1.6.2. - microbenchmark's multcomp reference missing

efficientR
efficientR copied to clipboard