mice icon indicating copy to clipboard operation
mice copied to clipboard

On futuremice() and reproducibility

Open edbonneville opened this issue 1 year ago • 3 comments

Dear {mice} team,

Thank you for your continued work on this package!

I had a couple of suggestions/questions regarding the futuremice() function. The first is on how the plan() is currently set within the function, which not recommended as per one of the {future} vignettes. If someone is using futuremice() because they want to make use of parallel processing anyway, would it not be more flexible to let them set their own plan()? Multiple arguments (e.g. future.plan, use.logical, n.core, parallelseed) would then no longer be necessary, and would allow advanced users to specify nested futures (e.g. when you want to parallelize imputations within parallelized simulation replications).

The second (related) point is that I think one of the major advantages of using this future framework is that your results can be reproducible regardless of how another user decides to parallelize the same analysis (e.g. local computer vs. computing cluster, differing numbers of cores etc). In the current implementation, futuremice() will not be able to reproduce the same set of imputations when only the number of cores differs (same dataset, number of imputations, and seed). For example:

# Globals
library(mice)
library(future)
dat <- mice::nhanes2
m <- 5

# Same seed, same m, but different n.core -> different results
set.seed(123)
with(futuremice(dat, m = m, n.core = 2), lm(chl ~ bmi)) |> pool()
#> Class: mipo    m = 5 
#>          term m   estimate        ubar           b           t dfcom        df
#> 1 (Intercept) 5 104.141215 3315.440261 1541.724538 5165.509707    23  9.482795
#> 2         bmi 5   3.407003    4.754373    1.771861    6.880607    23 10.864951
#>         riv    lambda       fmi
#> 1 0.5580162 0.3581582 0.4609944
#> 2 0.4472164 0.3090183 0.4086915

set.seed(123)
with(futuremice(dat, m = m, n.core = 3), lm(chl ~ bmi)) |> pool()
#> Class: mipo    m = 5 
#>          term m   estimate        ubar           b           t dfcom       df
#> 1 (Intercept) 5 109.612876 3135.720812 1783.198371 5275.558857    23 8.307415
#> 2         bmi 5   3.088449    4.326505    2.182976    6.946076    23 8.994692
#>         riv    lambda       fmi
#> 1 0.6824071 0.4056135 0.5107457
#> 2 0.6054705 0.3771296 0.4809873

This is a side effect of n.core dictating how the m imputations are spread out across the cores. A way to make the imputations reproducible regardless of n.core might look something like this:

# Tweaked futuremice()
library(future.apply) # since fewer dependencies than {furrr}
futuremice2 <- function(data, m, ...) {
  mids_list <- future_lapply(
    X = seq_len(m),
    FUN = function(x) mice::mice(data = data, m = 1, ...),
    future.seed = TRUE
  )
  imp <- Reduce(f = mice::ibind, x = mids_list)
  return(imp)
}

# Same seed, same m, but different n.core -> same results
set.seed(123)
plan(multisession, workers = 2)
with(futuremice2(dat, m = m, print = FALSE), lm(chl ~ bmi)) |> pool()
#> Class: mipo    m = 5 
#>          term m   estimate       ubar           b           t dfcom       df
#> 1 (Intercept) 5 107.820027 2789.44692 673.0012948 3597.048470    23 13.63506
#> 2         bmi 5   3.283595    3.87059   0.7406842    4.759411    23 15.00680
#>         riv    lambda       fmi
#> 1 0.2895203 0.2245178 0.3177525
#> 2 0.2296345 0.1867502 0.2770772

set.seed(123)
plan(multisession, workers = 3)
with(futuremice2(dat, m = m, print = FALSE), lm(chl ~ bmi)) |> pool()
#> Class: mipo    m = 5 
#>          term m   estimate       ubar           b           t dfcom       df
#> 1 (Intercept) 5 107.820027 2789.44692 673.0012948 3597.048470    23 13.63506
#> 2         bmi 5   3.283595    3.87059   0.7406842    4.759411    23 15.00680
#>         riv    lambda       fmi
#> 1 0.2895203 0.2245178 0.3177525
#> 2 0.2296345 0.1867502 0.2770772

# Reset back to sequential
plan(sequential)

Of course if you do some benchmarks, futuremice2() will be slower than futuremice(), since the former calls mice() a whole m times while the latter only n.core times. Nevertheless, a) futuremice2() will still outperform sequential mice() (depending on size of dataset and number of imputations); b) you could speed it up further by for example calling mice() a first time to perform the necessary checks, and then only call a function which would boil down to running mice() from here, which uses the defined predictorMatrix/maxit/method/etc from the first call.

So my main point: the current implementation does optimize computational resources nicely, but not reproducibility - so something like futuremice2() could be a compromise. What do you think? I guess two arguments for keeping the current implementation is that a) futuremice() should focus on optimizing speed; b) we should anyway be using a large enough number of imputed datasets such that our results do not differ much between independent imputation rounds.. :sweat_smile:

Thanks!

Session info

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Dutch_Netherlands.utf8
#>  ctype    Dutch_Netherlands.utf8
#>  tz       Europe/Berlin
#>  date     2023-05-19
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version date (UTC) lib source
#>  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.2.1)
#>  backports      1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  broom          1.0.1   2022-08-29 [1] CRAN (R 4.2.1)
#>  cachem         1.0.6   2021-08-19 [1] CRAN (R 4.2.1)
#>  callr          3.7.3   2022-11-02 [1] CRAN (R 4.2.2)
#>  cli            3.4.1   2022-09-23 [1] CRAN (R 4.2.2)
#>  codetools      0.2-18  2020-11-04 [2] CRAN (R 4.2.1)
#>  crayon         1.5.2   2022-09-29 [1] CRAN (R 4.2.2)
#>  DBI            1.1.3   2022-06-18 [1] CRAN (R 4.2.1)
#>  devtools       2.4.5   2022-10-11 [1] CRAN (R 4.2.2)
#>  digest         0.6.30  2022-10-18 [1] CRAN (R 4.2.2)
#>  dplyr          1.0.10  2022-09-01 [1] CRAN (R 4.2.1)
#>  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.2.1)
#>  evaluate       0.18    2022-11-07 [1] CRAN (R 4.2.2)
#>  fansi          1.0.3   2022-03-24 [1] CRAN (R 4.2.1)
#>  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.2.1)
#>  fs             1.5.2   2021-12-08 [1] CRAN (R 4.2.1)
#>  furrr          0.3.1   2022-08-15 [1] CRAN (R 4.2.1)
#>  future       * 1.29.0  2022-11-06 [1] CRAN (R 4.2.2)
#>  future.apply * 1.10.0  2022-11-05 [1] CRAN (R 4.2.2)
#>  generics       0.1.3   2022-07-05 [1] CRAN (R 4.2.1)
#>  globals        0.16.1  2022-08-28 [1] CRAN (R 4.2.1)
#>  glue           1.6.2   2022-02-24 [1] CRAN (R 4.2.1)
#>  highr          0.9     2021-04-16 [1] CRAN (R 4.2.1)
#>  htmltools      0.5.3   2022-07-18 [1] CRAN (R 4.2.1)
#>  htmlwidgets    1.5.4   2021-09-08 [1] CRAN (R 4.2.1)
#>  httpuv         1.6.6   2022-09-08 [1] CRAN (R 4.2.2)
#>  knitr          1.40    2022-08-24 [1] CRAN (R 4.2.1)
#>  later          1.3.0   2021-08-18 [1] CRAN (R 4.2.1)
#>  lattice        0.20-45 2021-09-22 [2] CRAN (R 4.2.1)
#>  lifecycle      1.0.3   2022-10-07 [1] CRAN (R 4.2.2)
#>  listenv        0.8.0   2019-12-05 [1] CRAN (R 4.2.1)
#>  magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.2.1)
#>  memoise        2.0.1   2021-11-26 [1] CRAN (R 4.2.1)
#>  mice         * 3.15.0  2022-11-19 [1] CRAN (R 4.2.3)
#>  mime           0.12    2021-09-28 [1] CRAN (R 4.2.0)
#>  miniUI         0.1.1.1 2018-05-18 [1] CRAN (R 4.2.1)
#>  parallelly     1.32.1  2022-07-21 [1] CRAN (R 4.2.1)
#>  pillar         1.8.1   2022-08-19 [1] CRAN (R 4.2.1)
#>  pkgbuild       1.3.1   2021-12-20 [1] CRAN (R 4.2.1)
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.2.1)
#>  pkgload        1.3.2   2022-11-16 [1] CRAN (R 4.2.1)
#>  prettyunits    1.1.1   2020-01-24 [1] CRAN (R 4.2.1)
#>  processx       3.8.0   2022-10-26 [1] CRAN (R 4.2.2)
#>  profvis        0.3.7   2020-11-02 [1] CRAN (R 4.2.1)
#>  promises       1.2.0.1 2021-02-11 [1] CRAN (R 4.2.1)
#>  ps             1.7.2   2022-10-26 [1] CRAN (R 4.2.2)
#>  purrr          0.3.5   2022-10-06 [1] CRAN (R 4.2.2)
#>  R.cache        0.16.0  2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3    1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo           1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils        2.12.2  2022-11-11 [1] CRAN (R 4.2.1)
#>  R6             2.5.1   2021-08-19 [1] CRAN (R 4.2.1)
#>  Rcpp           1.0.9   2022-07-08 [1] CRAN (R 4.2.1)
#>  remotes        2.4.2   2021-11-30 [1] CRAN (R 4.2.1)
#>  reprex         2.0.2   2022-08-17 [1] CRAN (R 4.2.1)
#>  rlang          1.0.6   2022-09-24 [1] CRAN (R 4.2.2)
#>  rmarkdown      2.18    2022-11-09 [1] CRAN (R 4.2.2)
#>  rstudioapi     0.14    2022-08-22 [1] CRAN (R 4.2.1)
#>  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.2.1)
#>  shiny          1.7.3   2022-10-25 [1] CRAN (R 4.2.2)
#>  stringi        1.7.8   2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr        1.5.0   2022-12-02 [1] CRAN (R 4.2.3)
#>  styler         1.8.1   2022-11-07 [1] CRAN (R 4.2.2)
#>  tibble         3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
#>  tidyr          1.2.1   2022-09-08 [1] CRAN (R 4.2.2)
#>  tidyselect     1.2.0   2022-10-10 [1] CRAN (R 4.2.2)
#>  urlchecker     1.0.1   2021-11-30 [1] CRAN (R 4.2.1)
#>  usethis        2.1.6   2022-05-25 [1] CRAN (R 4.2.1)
#>  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.2.1)
#>  vctrs          0.5.1   2022-11-16 [1] CRAN (R 4.2.1)
#>  withr          2.5.0   2022-03-03 [1] CRAN (R 4.2.1)
#>  xfun           0.35    2022-11-16 [1] CRAN (R 4.2.1)
#>  xtable         1.8-4   2019-04-21 [1] CRAN (R 4.2.1)
#>  yaml           2.3.6   2022-10-18 [1] CRAN (R 4.2.1)
#> 
#>  [1] C:/Users/efbonneville/software/packagesR
#>  [2] C:/Program Files/R/R-4.2.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

edbonneville avatar May 19 '23 13:05 edbonneville