mlr3proba
mlr3proba copied to clipboard
parallization issues with coxph
Hi all,
we want to run a nested crossvalidaiton with surv.coxph. Therefore we want to parallize the outer loop (using future).
Unfortunately the parallization does not seem to work, we need the same cpu time for different numbers of workers. E.g for workers = 4 we need 4.424515 mins and or 2 workers we need 2.448493 mins. Using nested crossvalidation for a classifcation task we could nicely see the decrease in time (by increasing the workers).
Below you can find a example:
library(mlr3)
library(mlr3fselect)
library(mlr3learners)
library(dplyr)
library(mlr3proba)
task = TaskSurv$new(survival::rats %>% select(-sex), id = "right_censored",
time = "time", event = "status", type = "right")
# Learner
learner_auto <- AutoFSelector$new(
learner = lrn("surv.coxph"),
resampling = rsmp("cv",folds = 3),
measure = msr( 'surv.cindex'),
store_models = F, store_benchmark_result = F, store_fselect_instance = T,
terminator = trm("evals", n_evals = 100000),
fselector = fs("exhaustive_search", max_features = 1)
)
# outer loop
outer_rsmp <- rsmp("subsampling", repeats = 16, ratio = 0.7)
outer_rsmp$instantiate(task)
## benchmark grid
design <- benchmark_grid(
learners = learner_auto,
tasks = task,
resamplings = outer_rsmp
)
# parallelization
future::plan(list(future::tweak("multisession", **workers = 4)**, 'sequential') )
start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F)
end = Sys.time()
print(end-start)
# parallelization
future::plan(list(future::tweak("multisession", **workers = 2**), 'sequential') )
start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F)
end = Sys.time()
print(end-start)
Best regards, Lena
Hey sorry I can't get your example working, can you please use reprex::reprex
?
Hi Raphael,
sorry about that. I hope it works now, if not, please let me know!
Best, Lena
library(mlr3)
library(mlr3fselect)
library(mlr3learners)
library(dplyr)
library(mlr3proba)
task = TaskSurv$new(survival::rats %>% select(-sex), id = "right_censored",
time = "time", event = "status", type = "right")
# Learner
learner_auto <- AutoFSelector$new(
learner = lrn("surv.coxph"),
resampling = rsmp("cv",folds = 3),
measure = msr( 'surv.cindex'),
store_models = F, store_benchmark_result = F, store_fselect_instance = T,
terminator = trm("evals", n_evals = 100000),
fselector = fs("exhaustive_search", max_features = 1)
)
# outer loop
outer_rsmp <- rsmp("subsampling", repeats = 16, ratio = 0.7)
outer_rsmp$instantiate(task)
## benchmark grid
design <- benchmark_grid(
learners = learner_auto,
tasks = task,
resamplings = outer_rsmp
)
# parallelization
future::plan(list(future::tweak("multisession", workers = 4), 'sequential') )
start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F)
end = Sys.time()
print(end-start)
#> Time difference of 2.697443 mins
# parallelization
future::plan(list(future::tweak("multisession", workers = 2), 'sequential') )
start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F)
end = Sys.time()
print(end-start)
#> Time difference of 1.315648 mins
Created on 2022-06-01 by the reprex package (v2.0.0)
Thanks, I removed the uninformative INFO. So to me it looks strange because workers = 2
actually takes half the tim as workers = 4
. I haven't done any work with parallelisation in mlr3proba TBH, so maybe another member of the team who works more with future
can help. Maybe @be-marc ?
Doubling the time with more cores is very strange. Two ideas
- Do you run into memory issues with 4 workers? Using the swap would slow down the calculations.
- How fast is the sequential execution?
Future
needs to serialize and deserialize the objects which are passed between the main process and the workers. When serialization is a major overhead, 4 workers might be slower than 2.
I thinks its 2). You can see below that without parallization it takes 12 sec, with 2 workers 2 min, with 4 workers 3 minutes, and with 8 workers 6 min.
library(mlr3)
library(mlr3fselect)
library(mlr3learners)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(mlr3proba)
task = TaskSurv$new(survival::rats %>% select(-sex), id = "right_censored",
time = "time", event = "status", type = "right")
# Learner
learner_auto <- AutoFSelector$new(
learner = lrn("surv.coxph"),
resampling = rsmp("cv",folds = 3),
measure = msr( 'surv.cindex'),
store_models = F, store_benchmark_result = F, store_fselect_instance = T,
terminator = trm("evals", n_evals = 100000),
fselector = fs("exhaustive_search", max_features = 1)
)
# outer loop
outer_rsmp <- rsmp("subsampling", repeats = 16, ratio = 0.7)
outer_rsmp$instantiate(task)
## benchmark grid
design <- benchmark_grid(
learners = learner_auto,
tasks = task,
resamplings = outer_rsmp
)
lgr::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")
start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F)
end = Sys.time()
print(end-start)
#> Time difference of 12.27735 secs
# parallelization
future::plan(list(future::tweak("multisession", workers = 2), 'sequential') )
start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F)
end = Sys.time()
print(end-start)
#> Time difference of 2.162249 mins
# parallelization
future::plan(list(future::tweak("multisession", workers = 4), 'sequential') )
start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F)
end = Sys.time()
print(end-start)
#> Time difference of 3.286482 mins
# parallelization
future::plan(list(future::tweak("multisession", workers = 8), 'sequential') )
start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F)
end = Sys.time()
print(end-start)
#> Time difference of 6.048725 mins
Created on 2022-06-01 by the reprex package (v2.0.0)
Thanks for testing. I will investigate this further.
I will post a few tests.
Learner serialization.
library(mlr3proba)
library(tictoc)
task = TaskSurv$new(survival::rats, id = "right_censored", time = "time", event = "status", type = "right")
task$select(setdiff(task$feature_names, "sex"))
learner = lrn("surv.coxph")
learner$train(task)
# serialize
tic()
x = serialize(learner, connection = NULL)
toc()
# 0.011 seconds
AutoFSelector with random search and full batch.
afs = AutoFSelector$new(
learner = lrn("surv.coxph"),
resampling = rsmp("cv",folds = 3),
measure = msr( 'surv.cindex'),
store_models = FALSE,
store_benchmark_result = FALSE,
store_fselect_instance = TRUE,
terminator = trm("evals", n_evals = 100),
fselector = fs("random_search", batch_size = 100)
)
## sequential
future::plan("sequential")
tic()
afs$train(task)
toc()
# 35 seconds
## multisession
future::plan("multisession", workers = 4)
tic()
afs$train(task)
toc()
# 20 seconds
Nested resampling.
resampling = rsmp("cv", folds = 3)
## sequential
future::plan("sequential")
tic()
rr = resample(task, afs, resampling)
toc()
# 112 seconds
## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))
tic()
rr = resample(task, afs, resampling)
toc()
# 64 seconds
AutoFSelector with exhaustive search.
afs = AutoFSelector$new(
learner = lrn("surv.coxph"),
resampling = rsmp("cv",folds = 3),
measure = msr( 'surv.cindex'),
store_models = FALSE,
store_benchmark_result = FALSE,
store_fselect_instance = TRUE,
terminator = trm("none"),
fselector = fs("exhaustive_search", max_features = 1)
)
## sequential
future::plan("sequential")
tic()
rr = resample(task, afs, resampling)
toc()
# 1.5 seconds
## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))
tic()
rr = resample(task, afs, resampling)
toc()
# 4.4 seconds
AutoFSelector with more outer iterations
resampling = rsmp("repeated_cv", folds = 3, repeats = 10)
afs = AutoFSelector$new(
learner = lrn("surv.coxph"),
resampling = rsmp("cv",folds = 3),
measure = msr( 'surv.cindex'),
store_models = FALSE,
store_benchmark_result = FALSE,
store_fselect_instance = TRUE,
terminator = trm("none"),
fselector = fs("exhaustive_search", max_features = 1)
)
## sequential
future::plan("sequential")
tic()
rr = resample(task, afs, resampling)
toc()
# 26 seconds
## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))
tic()
rr = resample(task, afs, resampling)
toc()
# 14 seconds
AutoFSelector with subsampling.
resampling = rsmp("subsampling", repeats = 16, ratio = 0.7)
afs = AutoFSelector$new(
learner = lrn("surv.coxph"),
resampling = rsmp("cv",folds = 3),
measure = msr( 'surv.cindex'),
store_models = FALSE,
store_benchmark_result = FALSE,
store_fselect_instance = TRUE,
terminator = trm("none"),
fselector = fs("exhaustive_search", max_features = 1)
)
## sequential
future::plan("sequential")
tic()
rr = resample(task, afs, resampling)
toc()
# 13 seconds
## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))
tic()
rr = resample(task, afs, resampling)
toc()
# 8 seconds
@lenaeb I cannot reproduce the issue. Can you run the following test on your machine and confirm?
library(mlr3proba)
library(mlr3fselect)
library(tictoc)
task = TaskSurv$new(survival::rats, id = "right_censored", time = "time", event = "status", type = "right")
task$select(setdiff(task$feature_names, "sex"))
afs = AutoFSelector$new(
learner = lrn("surv.coxph"),
resampling = rsmp("cv", folds = 3),
measure = msr("surv.cindex"),
store_models = FALSE,
store_benchmark_result = FALSE,
store_fselect_instance = TRUE,
terminator = trm("none"),
fselector = fs("exhaustive_search", max_features = 1)
)
# outer loop
resampling = rsmp("subsampling", repeats = 16, ratio = 0.7)
resampling$instantiate(task)
## benchmark grid
design = benchmark_grid(
learners = afs,
tasks = task,
resamplings = resampling
)
## sequential
future::plan("sequential")
tic()
bmr = benchmark(design,store_models = TRUE, store_backends = FALSE)
toc()
# 15 seconds
## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))
tic()
bmr = benchmark(design, store_models = TRUE, store_backends = FALSE)
toc()
# 8 seconds
The problem is still remaing.. here you can find the output when we run the script on the cluster:
library(mlr3proba)
#> Loading required package: mlr3
library(mlr3fselect)
library(tictoc)
task = TaskSurv$new(survival::rats, id = "right_censored", time = "time", event = "status", type = "right")
task$select(setdiff(task$feature_names, "sex"))
afs = AutoFSelector$new(
learner = lrn("surv.coxph"),
resampling = rsmp("cv", folds = 3),
measure = msr("surv.cindex"),
store_models = FALSE,
store_benchmark_result = FALSE,
store_fselect_instance = TRUE,
terminator = trm("none"),
fselector = fs("exhaustive_search", max_features = 1)
)
# outer loop
resampling = rsmp("subsampling", repeats = 16, ratio = 0.7)
resampling$instantiate(task)
## benchmark grid
design = benchmark_grid(
learners = afs,
tasks = task,
resamplings = resampling
)
lgr::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")
## sequential
future::plan("sequential")
tic()
bmr = benchmark(design,store_models = TRUE, store_backends = FALSE)
toc()
#> 14.753 sec elapsed
## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))
tic()
bmr = benchmark(design, store_models = TRUE, store_backends = FALSE)
toc()
#> 142.167 sec elapsed
Created on 2022-06-01 by the reprex package (v2.0.0)
Here on the local machine, the runtime is more or less the same...
library(mlr3proba)
#> Lade nötiges Paket: mlr3
library(mlr3fselect)
library(tictoc)
task = TaskSurv$new(survival::rats, id = "right_censored", time = "time", event = "status", type = "right")
task$select(setdiff(task$feature_names, "sex"))
afs = AutoFSelector$new(
learner = lrn("surv.coxph"),
resampling = rsmp("cv", folds = 3),
measure = msr("surv.cindex"),
store_models = FALSE,
store_benchmark_result = FALSE,
store_fselect_instance = TRUE,
terminator = trm("none"),
fselector = fs("exhaustive_search", max_features = 1)
)
# outer loop
resampling = rsmp("subsampling", repeats = 16, ratio = 0.7)
resampling$instantiate(task)
## benchmark grid
design = benchmark_grid(
learners = afs,
tasks = task,
resamplings = resampling
)
lgr::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")
## sequential
future::plan("sequential")
tic()
bmr = benchmark(design,store_models = TRUE, store_backends = FALSE)
toc()
#> 17.09 sec elapsed
## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))
tic()
bmr = benchmark(design, store_models = TRUE, store_backends = FALSE)
toc()
#> 14.58 sec elapsed
Created on 2022-06-01 by the reprex package (v2.0.1)
More or less the same is okay. fs("exhaustive_search", max_features = 1)
produces only 2 combinations and therefore finishes very quickly. Parallelization introduces too much overhead in this case.
It gets complicated when it only happens on the cluster. Do you have a dedicated machine there? And do you run the script with future::plan(list(future::tweak("multisession", workers = 2), "sequential"))
on the cluster? @RaphaelS1 Can you run my last example on your local machine?
Yes we have a dedicated machine there. Yes we use this code snip to run it on the cluster.
Please try to update all packages and run the script again. Post your sessionInfo()
. One package might be old and produces large objects which are slowly serialized.
Only for mlr3learner ( --> 0.5.3) there was an update avaible, unfortunately it does not help.
So i updated the mlr3, mlr3fselect, mlr3proba, package to the dev version via remotes::install_github("mlr-org/**")
Here is the output, still the same..
library(mlr3proba)
#> Loading required package: mlr3
library(mlr3fselect)
library(tictoc)
task = TaskSurv$new(survival::rats, id = "right_censored", time = "time", event = "status", type = "right")
task$select(setdiff(task$feature_names, "sex"))
afs = AutoFSelector$new(
learner = lrn("surv.coxph"),
resampling = rsmp("cv", folds = 3),
measure = msr("surv.cindex"),
store_models = FALSE,
store_benchmark_result = FALSE,
store_fselect_instance = TRUE,
terminator = trm("none"),
fselector = fs("exhaustive_search", max_features = 1)
)
# outer loop
resampling = rsmp("subsampling", repeats = 16, ratio = 0.7)
resampling$instantiate(task)
## benchmark grid
design = benchmark_grid(
learners = afs,
tasks = task,
resamplings = resampling
)
lgr::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")
## sequential
future::plan("sequential")
tic()
bmr = benchmark(design,store_models = TRUE, store_backends = FALSE)
toc()
#> 14.652 sec elapsed
## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))
tic()
bmr = benchmark(design, store_models = TRUE, store_backends = FALSE)
toc()
#> 133.025 sec elapsed
sessionInfo()
#> R version 4.0.5 (2021-03-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Red Hat Enterprise Linux 8.2 (Ootpa)
#>
#> Matrix products: default
#> BLAS/LAPACK: /apps/rocs/2020.08/cascadelake/software/OpenBLAS/0.3.9-GCC-9.3.0/lib/libopenblas_skylakexp-r0.3.9.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] tictoc_1.0.1 mlr3fselect_0.7.1 mlr3proba_0.4.12 mlr3_0.13.3-9000
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.8.3 compiler_4.0.5 highr_0.9
#> [4] mlr3misc_0.10.0 tools_4.0.5 digest_0.6.29
#> [7] uuid_1.1-0 lattice_0.20-44 evaluate_0.15
#> [10] checkmate_2.1.0 dictionar6_0.1.3 rlang_1.0.2
#> [13] Matrix_1.3-3 reprex_2.0.0 cli_3.3.0
#> [16] rstudioapi_0.13 bbotk_0.5.3 yaml_2.2.1
#> [19] parallel_4.0.5 xfun_0.22 withr_2.5.0
#> [22] stringr_1.4.0 knitr_1.33 fs_1.5.0
#> [25] globals_0.15.0 grid_4.0.5 glue_1.6.2
#> [28] data.table_1.14.2 listenv_0.8.0 param6_0.2.4
#> [31] distr6_1.6.10 R6_2.5.1 future.apply_1.9.0
#> [34] parallelly_1.31.1 survival_3.2-11 rmarkdown_2.11
#> [37] lgr_0.4.3 magrittr_2.0.3 mlr3pipelines_0.4.1
#> [40] splines_4.0.5 backports_1.4.1 codetools_0.2-18
#> [43] htmltools_0.5.1.1 palmerpenguins_0.1.0 future_1.26.1
#> [46] set6_0.2.4 paradox_0.9.0 ooplah_0.2.0
#> [49] stringi_1.6.1 crayon_1.5.1
Created on 2022-06-07 by the reprex package (v2.0.0)
Hi @lenaeb, this seems more like a techincal issue rather a problem with the mlr3proba
. Out of curiosity, did you maybe got it to work the way you wanted in the end? Stackoverflow
might be a better place to address this sort of questions :)
Closed due to inactivity