mice
mice copied to clipboard
On futuremice() and reproducibility
Dear {mice}
team,
Thank you for your continued work on this package!
I had a couple of suggestions/questions regarding the futuremice()
function. The first is on how the plan()
is currently set within the function, which not recommended as per one of the {future}
vignettes. If someone is using futuremice()
because they want to make use of parallel processing anyway, would it not be more flexible to let them set their own plan()
? Multiple arguments (e.g. future.plan, use.logical, n.core, parallelseed) would then no longer be necessary, and would allow advanced users to specify nested futures (e.g. when you want to parallelize imputations within parallelized simulation replications).
The second (related) point is that I think one of the major advantages of using this future framework is that your results can be reproducible regardless of how another user decides to parallelize the same analysis (e.g. local computer vs. computing cluster, differing numbers of cores etc). In the current implementation, futuremice()
will not be able to reproduce the same set of imputations when only the number of cores differs (same dataset, number of imputations, and seed). For example:
# Globals
library(mice)
library(future)
dat <- mice::nhanes2
m <- 5
# Same seed, same m, but different n.core -> different results
set.seed(123)
with(futuremice(dat, m = m, n.core = 2), lm(chl ~ bmi)) |> pool()
#> Class: mipo m = 5
#> term m estimate ubar b t dfcom df
#> 1 (Intercept) 5 104.141215 3315.440261 1541.724538 5165.509707 23 9.482795
#> 2 bmi 5 3.407003 4.754373 1.771861 6.880607 23 10.864951
#> riv lambda fmi
#> 1 0.5580162 0.3581582 0.4609944
#> 2 0.4472164 0.3090183 0.4086915
set.seed(123)
with(futuremice(dat, m = m, n.core = 3), lm(chl ~ bmi)) |> pool()
#> Class: mipo m = 5
#> term m estimate ubar b t dfcom df
#> 1 (Intercept) 5 109.612876 3135.720812 1783.198371 5275.558857 23 8.307415
#> 2 bmi 5 3.088449 4.326505 2.182976 6.946076 23 8.994692
#> riv lambda fmi
#> 1 0.6824071 0.4056135 0.5107457
#> 2 0.6054705 0.3771296 0.4809873
This is a side effect of n.core
dictating how the m
imputations are spread out across the cores. A way to make the imputations reproducible regardless of n.core
might look something like this:
# Tweaked futuremice()
library(future.apply) # since fewer dependencies than {furrr}
futuremice2 <- function(data, m, ...) {
mids_list <- future_lapply(
X = seq_len(m),
FUN = function(x) mice::mice(data = data, m = 1, ...),
future.seed = TRUE
)
imp <- Reduce(f = mice::ibind, x = mids_list)
return(imp)
}
# Same seed, same m, but different n.core -> same results
set.seed(123)
plan(multisession, workers = 2)
with(futuremice2(dat, m = m, print = FALSE), lm(chl ~ bmi)) |> pool()
#> Class: mipo m = 5
#> term m estimate ubar b t dfcom df
#> 1 (Intercept) 5 107.820027 2789.44692 673.0012948 3597.048470 23 13.63506
#> 2 bmi 5 3.283595 3.87059 0.7406842 4.759411 23 15.00680
#> riv lambda fmi
#> 1 0.2895203 0.2245178 0.3177525
#> 2 0.2296345 0.1867502 0.2770772
set.seed(123)
plan(multisession, workers = 3)
with(futuremice2(dat, m = m, print = FALSE), lm(chl ~ bmi)) |> pool()
#> Class: mipo m = 5
#> term m estimate ubar b t dfcom df
#> 1 (Intercept) 5 107.820027 2789.44692 673.0012948 3597.048470 23 13.63506
#> 2 bmi 5 3.283595 3.87059 0.7406842 4.759411 23 15.00680
#> riv lambda fmi
#> 1 0.2895203 0.2245178 0.3177525
#> 2 0.2296345 0.1867502 0.2770772
# Reset back to sequential
plan(sequential)
Of course if you do some benchmarks, futuremice2()
will be slower than futuremice()
, since the former calls mice()
a whole m
times while the latter only n.core
times. Nevertheless, a) futuremice2()
will still outperform sequential mice()
(depending on size of dataset and number of imputations); b) you could speed it up further by for example calling mice()
a first time to perform the necessary checks, and then only call a function which would boil down to running mice()
from here, which uses the defined predictorMatrix/maxit/method/etc from the first call.
So my main point: the current implementation does optimize computational resources nicely, but not reproducibility - so something like futuremice2()
could be a compromise. What do you think? I guess two arguments for keeping the current implementation is that a) futuremice()
should focus on optimizing speed; b) we should anyway be using a large enough number of imputed datasets such that our results do not differ much between independent imputation rounds.. :sweat_smile:
Thanks!
Session info
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.1 (2022-06-23 ucrt)
#> os Windows 10 x64 (build 19044)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate Dutch_Netherlands.utf8
#> ctype Dutch_Netherlands.utf8
#> tz Europe/Berlin
#> date 2023-05-19
#> pandoc 2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.1)
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
#> broom 1.0.1 2022-08-29 [1] CRAN (R 4.2.1)
#> cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.1)
#> callr 3.7.3 2022-11-02 [1] CRAN (R 4.2.2)
#> cli 3.4.1 2022-09-23 [1] CRAN (R 4.2.2)
#> codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.1)
#> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.2)
#> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.1)
#> devtools 2.4.5 2022-10-11 [1] CRAN (R 4.2.2)
#> digest 0.6.30 2022-10-18 [1] CRAN (R 4.2.2)
#> dplyr 1.0.10 2022-09-01 [1] CRAN (R 4.2.1)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.1)
#> evaluate 0.18 2022-11-07 [1] CRAN (R 4.2.2)
#> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.1)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.1)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.1)
#> furrr 0.3.1 2022-08-15 [1] CRAN (R 4.2.1)
#> future * 1.29.0 2022-11-06 [1] CRAN (R 4.2.2)
#> future.apply * 1.10.0 2022-11-05 [1] CRAN (R 4.2.2)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1)
#> globals 0.16.1 2022-08-28 [1] CRAN (R 4.2.1)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.1)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.2.1)
#> htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.1)
#> htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.2.1)
#> httpuv 1.6.6 2022-09-08 [1] CRAN (R 4.2.2)
#> knitr 1.40 2022-08-24 [1] CRAN (R 4.2.1)
#> later 1.3.0 2021-08-18 [1] CRAN (R 4.2.1)
#> lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.1)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.2)
#> listenv 0.8.0 2019-12-05 [1] CRAN (R 4.2.1)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.1)
#> memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.1)
#> mice * 3.15.0 2022-11-19 [1] CRAN (R 4.2.3)
#> mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
#> miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.1)
#> parallelly 1.32.1 2022-07-21 [1] CRAN (R 4.2.1)
#> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.1)
#> pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.2.1)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.1)
#> pkgload 1.3.2 2022-11-16 [1] CRAN (R 4.2.1)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.2.1)
#> processx 3.8.0 2022-10-26 [1] CRAN (R 4.2.2)
#> profvis 0.3.7 2020-11-02 [1] CRAN (R 4.2.1)
#> promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.1)
#> ps 1.7.2 2022-10-26 [1] CRAN (R 4.2.2)
#> purrr 0.3.5 2022-10-06 [1] CRAN (R 4.2.2)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.1)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.1)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.1)
#> Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.1)
#> remotes 2.4.2 2021-11-30 [1] CRAN (R 4.2.1)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.1)
#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.2)
#> rmarkdown 2.18 2022-11-09 [1] CRAN (R 4.2.2)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.1)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.1)
#> shiny 1.7.3 2022-10-25 [1] CRAN (R 4.2.2)
#> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.1)
#> stringr 1.5.0 2022-12-02 [1] CRAN (R 4.2.3)
#> styler 1.8.1 2022-11-07 [1] CRAN (R 4.2.2)
#> tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.1)
#> tidyr 1.2.1 2022-09-08 [1] CRAN (R 4.2.2)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.2)
#> urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.2.1)
#> usethis 2.1.6 2022-05-25 [1] CRAN (R 4.2.1)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.1)
#> vctrs 0.5.1 2022-11-16 [1] CRAN (R 4.2.1)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.1)
#> xfun 0.35 2022-11-16 [1] CRAN (R 4.2.1)
#> xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.1)
#> yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.1)
#>
#> [1] C:/Users/efbonneville/software/packagesR
#> [2] C:/Program Files/R/R-4.2.1/library
#>
#> ──────────────────────────────────────────────────────────────────────────────