parameters allow parallel computation during bootstrapping

allow parallel computation during bootstrapping

Open IndrajeetPatil opened this issue 3 years ago • 11 comments

This requires adding a new parallel argument to model_parameters and then passing the value to boot calls:

For example, here we can add parallel = parallel inside the call:

https://github.com/easystats/parameters/blob/d7fed242cc655d88c0417f17f48d15c657b67b2b/R/bootstrap_model.R#L85

We can also default to parallel = "multicore", so multiple cores - if available - are used by default.

wouldn't it be better to let that be passed throug ellipsis to avoid cluttering the API? Or to retrieve it from the options (as stan does) ?

Mar 09 '21 01:03 DominiqueMakowski

Can't create a reprex because parallel doesn't seem to work with it. But passing the dots works (PR: #439).

> set.seed(123)
> library(parameters)
> 
> mod <- lm(formula = wt ~ mpg, data = mtcars)
> 
> set.seed(123)
> system.time(model_parameters(mod, bootstrap = TRUE, iterations = 1000, parallel = "no")) 
   user  system elapsed 
  1.043   0.007   1.057 
> 
> set.seed(123)
> system.time(
+   model_parameters(
+     mod,
+     bootstrap = TRUE,
+     iterations = 1000,
+     parallel = "multicore",
+     ncpus = 4L
+   )
+ ) 
   user  system elapsed 
  0.078   0.056   0.613

Mar 09 '21 17:03 IndrajeetPatil

"multicore" doesn't work on windows.

Mar 09 '21 18:03 strengejacke

Using normal R, or Microsoft R Open doesn't seem to make a difference, increasing used CPUs even slows down:

library(parameters)
#> Warning: Paket 'parameters' wurde unter R Version 4.0.4 erstellt
model <- lm(mpg ~ wt + cyl, data = mtcars)

microbenchmark::microbenchmark(
  model_parameters(model, bootstrap = TRUE, iterations = 1000, parallel = "snow", ncpus = 4),
  times = 5
)
#> Unit: seconds
#>                                                                                             expr
#>  model_parameters(model, bootstrap = TRUE, iterations = 1000,      parallel = "snow", ncpus = 4)
#>       min       lq    mean   median       uq      max neval
#>  2.146296 2.178574 2.18241 2.179772 2.200774 2.206634     5

microbenchmark::microbenchmark(
  model_parameters(model, bootstrap = TRUE, iterations = 1000, parallel = "no", ncpus = 4),
  times = 5
)
#> Unit: seconds
#>                                                                                           expr
#>  model_parameters(model, bootstrap = TRUE, iterations = 1000,      parallel = "no", ncpus = 4)
#>       min      lq     mean   median       uq      max neval
#>  1.120941 1.12849 1.132289 1.128846 1.137772 1.145394     5

microbenchmark::microbenchmark(
  model_parameters(model, bootstrap = TRUE, iterations = 1000, parallel = "multicore", ncpus = 4),
  times = 5
)
#> Unit: seconds
#>                                                                                                  expr
#>  model_parameters(model, bootstrap = TRUE, iterations = 1000,      parallel = "multicore", ncpus = 4)
#>       min      lq     mean   median      uq      max neval
#>  1.102907 1.10788 1.117547 1.114816 1.12571 1.136424     5

^{Created on 2021-03-09 by the reprex package (v1.0.0)}

Mar 09 '21 18:03 strengejacke

Yeah, I am also seeing the same on my Mac that the computation time actually increases if I use parallel computing with ncpus set to some value > 1.

It's all a bit confusing. And this has nothing to do with parameters functions.

Here is an example from the boot package docs:

library(boot)
library(microbenchmark)

# usual bootstrap of the ratio of means using the city data
ratio <- function(d, w) sum(d$x * w) / sum(d$u * w)

set.seed(123)
microbenchmark::microbenchmark(
  boot(city, ratio, R = 4999, stype = "w"),
  times = 5
)
#> Unit: milliseconds
#>                                      expr      min       lq     mean   median
#>  boot(city, ratio, R = 4999, stype = "w") 30.76705 36.27656 39.59618 40.73334
#>        uq      max neval
#>  42.90163 47.30233     5

options(boot.parallel = "multicore")
set.seed(123)
microbenchmark::microbenchmark(
  boot(city, ratio, R = 4999, stype = "w", ncpus = 5),
  times = 5
)
#> Unit: milliseconds
#>                                                 expr      min       lq    mean
#>  boot(city, ratio, R = 4999, stype = "w", ncpus = 5) 44.64621 47.21875 51.9313
#>    median       uq     max neval
#>  48.56907 50.58117 68.6413     5

^{Created on 2021-03-10 by the reprex package (v1.0.0)}

I think we should stay away from making any changes to parameters until we figure out how to successfully use boot's parallel computation functionality.

Mar 10 '21 08:03 IndrajeetPatil

Yes, sounds good.

Mar 10 '21 09:03 strengejacke

@bwiernik Do you have any ideas about how to get this to work?

Jul 04 '21 08:07 IndrajeetPatil

Yeah, I can take a look

Jul 04 '21 14:07 bwiernik

future is probably a better platform for cross-platform parallel computation: https://cran.r-project.org/web/packages/future/index.html

The examples in this thread are probably all too small (OLS with N=32), so the parallel overhead is heavier than the gains.

Perhaps one strategy would be for us to support extracting results from boot and other bootstrap objects. That way, users who want fancy features like parallel computation can use the existing support in the appropriate package, and we can extract and display the estimates.

Jun 25 '22 10:06 vincentarelbundock

One of the major benefits of parameters is that we provide a simple interface for bootstrapping that otherwise are really difficult for new users (learning to use the boot package is a nightmare). I agree that we should use future for parallelization, but I do think we should support it.

Jun 25 '22 12:06 bwiernik

You're right. boot is kind of a nightmare to learn.

Jun 25 '22 12:06 vincentarelbundock

parameters parameters copied to clipboard

allow parallel computation during bootstrapping

parameters
parameters copied to clipboard