mlr3proba icon indicating copy to clipboard operation
mlr3proba copied to clipboard

parallization issues with coxph

Open lenaeb opened this issue 2 years ago • 18 comments

Hi all,

we want to run a nested crossvalidaiton with surv.coxph. Therefore we want to parallize the outer loop (using future).

Unfortunately the parallization does not seem to work, we need the same cpu time for different numbers of workers. E.g for workers = 4 we need 4.424515 mins and or 2 workers we need 2.448493 mins. Using nested crossvalidation for a classifcation task we could nicely see the decrease in time (by increasing the workers).

Below you can find a example:

library(mlr3)
library(mlr3fselect)
library(mlr3learners)
library(dplyr)
library(mlr3proba)
task = TaskSurv$new(survival::rats %>% select(-sex), id = "right_censored",
                    time = "time", event = "status", type = "right")

# Learner
learner_auto <- AutoFSelector$new(
  learner = lrn("surv.coxph"),
  resampling = rsmp("cv",folds = 3),
  measure = msr( 'surv.cindex'),
  store_models = F, store_benchmark_result = F, store_fselect_instance = T,
  terminator = trm("evals", n_evals = 100000),
  fselector = fs("exhaustive_search", max_features = 1)
)

# outer loop
outer_rsmp <- rsmp("subsampling", repeats = 16, ratio = 0.7)
outer_rsmp$instantiate(task)

## benchmark grid
design <- benchmark_grid(
  learners = learner_auto,
  tasks = task,
  resamplings = outer_rsmp
)

# parallelization
future::plan(list(future::tweak("multisession", **workers = 4)**, 'sequential') )

start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F) 
end = Sys.time()
print(end-start)


# parallelization
future::plan(list(future::tweak("multisession", **workers = 2**), 'sequential') )

start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F) 
end = Sys.time()
print(end-start)

Best regards, Lena

lenaeb avatar May 31 '22 12:05 lenaeb

Hey sorry I can't get your example working, can you please use reprex::reprex ?

RaphaelS1 avatar May 31 '22 14:05 RaphaelS1

Hi Raphael,

sorry about that. I hope it works now, if not, please let me know!

Best, Lena

library(mlr3)
library(mlr3fselect)
library(mlr3learners)
library(dplyr)
library(mlr3proba)
task = TaskSurv$new(survival::rats %>% select(-sex), id = "right_censored",
                    time = "time", event = "status", type = "right")

# Learner
learner_auto <- AutoFSelector$new(
  learner = lrn("surv.coxph"),
  resampling = rsmp("cv",folds = 3),
  measure = msr( 'surv.cindex'),
  store_models = F, store_benchmark_result = F, store_fselect_instance = T,
  terminator = trm("evals", n_evals = 100000),
  fselector = fs("exhaustive_search", max_features = 1)
)

# outer loop
outer_rsmp <- rsmp("subsampling", repeats = 16, ratio = 0.7)
outer_rsmp$instantiate(task)

## benchmark grid
design <- benchmark_grid(
  learners = learner_auto,
  tasks = task,
  resamplings = outer_rsmp
)

# parallelization
future::plan(list(future::tweak("multisession", workers = 4), 'sequential') )

start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F) 
end = Sys.time()
print(end-start)
#> Time difference of 2.697443 mins


# parallelization
future::plan(list(future::tweak("multisession", workers = 2), 'sequential') )

start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F) 
end = Sys.time()
print(end-start)
#> Time difference of 1.315648 mins

Created on 2022-06-01 by the reprex package (v2.0.0)

lenaeb avatar Jun 01 '22 06:06 lenaeb

Thanks, I removed the uninformative INFO. So to me it looks strange because workers = 2 actually takes half the tim as workers = 4. I haven't done any work with parallelisation in mlr3proba TBH, so maybe another member of the team who works more with future can help. Maybe @be-marc ?

RaphaelS1 avatar Jun 01 '22 08:06 RaphaelS1

Doubling the time with more cores is very strange. Two ideas

  1. Do you run into memory issues with 4 workers? Using the swap would slow down the calculations.
  2. How fast is the sequential execution? Future needs to serialize and deserialize the objects which are passed between the main process and the workers. When serialization is a major overhead, 4 workers might be slower than 2.

be-marc avatar Jun 01 '22 08:06 be-marc

I thinks its 2). You can see below that without parallization it takes 12 sec, with 2 workers 2 min, with 4 workers 3 minutes, and with 8 workers 6 min.

library(mlr3)
library(mlr3fselect)
library(mlr3learners)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(mlr3proba)
task = TaskSurv$new(survival::rats %>% select(-sex), id = "right_censored",
                    time = "time", event = "status", type = "right")

# Learner
learner_auto <- AutoFSelector$new(
  learner = lrn("surv.coxph"),
  resampling = rsmp("cv",folds = 3),
  measure = msr( 'surv.cindex'),
  store_models = F, store_benchmark_result = F, store_fselect_instance = T,
  terminator = trm("evals", n_evals = 100000),
  fselector = fs("exhaustive_search", max_features = 1)
)

# outer loop
outer_rsmp <- rsmp("subsampling", repeats = 16, ratio = 0.7)
outer_rsmp$instantiate(task)

## benchmark grid
design <- benchmark_grid(
  learners = learner_auto,
  tasks = task,
  resamplings = outer_rsmp
)

lgr::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")

start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F) 
end = Sys.time()
print(end-start)
#> Time difference of 12.27735 secs


# parallelization
future::plan(list(future::tweak("multisession", workers = 2), 'sequential') )

start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F) 
end = Sys.time()
print(end-start)
#> Time difference of 2.162249 mins

# parallelization
future::plan(list(future::tweak("multisession", workers = 4), 'sequential') )

start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F) 
end = Sys.time()
print(end-start)
#> Time difference of 3.286482 mins


# parallelization
future::plan(list(future::tweak("multisession", workers = 8), 'sequential') )

start = Sys.time()
bench_res <- benchmark(design,store_models = T, store_backends = F) 
end = Sys.time()
print(end-start)
#> Time difference of 6.048725 mins

Created on 2022-06-01 by the reprex package (v2.0.0)

lenaeb avatar Jun 01 '22 08:06 lenaeb

Thanks for testing. I will investigate this further.

be-marc avatar Jun 01 '22 08:06 be-marc

I will post a few tests.

Learner serialization.

library(mlr3proba)
library(tictoc)

task = TaskSurv$new(survival::rats, id = "right_censored", time = "time", event = "status", type = "right")
task$select(setdiff(task$feature_names, "sex"))

learner = lrn("surv.coxph")
learner$train(task)

# serialize
tic()
x = serialize(learner, connection = NULL)
toc()
# 0.011 seconds

be-marc avatar Jun 01 '22 09:06 be-marc

AutoFSelector with random search and full batch.

afs = AutoFSelector$new(
  learner = lrn("surv.coxph"),
  resampling = rsmp("cv",folds = 3),
  measure = msr( 'surv.cindex'),
  store_models = FALSE, 
  store_benchmark_result = FALSE,
  store_fselect_instance = TRUE,
  terminator = trm("evals", n_evals = 100),
  fselector = fs("random_search", batch_size = 100)
)

## sequential
future::plan("sequential")

tic()
afs$train(task)
toc()
# 35 seconds

## multisession
future::plan("multisession", workers = 4)

tic()
afs$train(task)
toc()
# 20 seconds

be-marc avatar Jun 01 '22 09:06 be-marc

Nested resampling.

resampling = rsmp("cv", folds = 3)

## sequential
future::plan("sequential")

tic()
rr = resample(task, afs, resampling)
toc()
# 112 seconds

## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))

tic()
rr = resample(task, afs, resampling)
toc()
# 64 seconds

be-marc avatar Jun 01 '22 09:06 be-marc

AutoFSelector with exhaustive search.

afs = AutoFSelector$new(
  learner = lrn("surv.coxph"),
  resampling = rsmp("cv",folds = 3),
  measure = msr( 'surv.cindex'),
  store_models = FALSE, 
  store_benchmark_result = FALSE,
  store_fselect_instance = TRUE,
  terminator = trm("none"),
  fselector = fs("exhaustive_search", max_features = 1)
)

## sequential
future::plan("sequential")

tic()
rr = resample(task, afs, resampling)
toc()
# 1.5 seconds

## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))

tic()
rr = resample(task, afs, resampling)
toc()
# 4.4 seconds

be-marc avatar Jun 01 '22 09:06 be-marc

AutoFSelector with more outer iterations

resampling = rsmp("repeated_cv", folds = 3, repeats = 10)

afs = AutoFSelector$new(
  learner = lrn("surv.coxph"),
  resampling = rsmp("cv",folds = 3),
  measure = msr( 'surv.cindex'),
  store_models = FALSE, 
  store_benchmark_result = FALSE,
  store_fselect_instance = TRUE,
  terminator = trm("none"),
  fselector = fs("exhaustive_search", max_features = 1)
)

## sequential
future::plan("sequential")

tic()
rr = resample(task, afs, resampling)
toc()
# 26 seconds

## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))

tic()
rr = resample(task, afs, resampling)
toc()
# 14 seconds

be-marc avatar Jun 01 '22 09:06 be-marc

AutoFSelector with subsampling.

resampling = rsmp("subsampling", repeats = 16, ratio = 0.7)

afs = AutoFSelector$new(
  learner = lrn("surv.coxph"),
  resampling = rsmp("cv",folds = 3),
  measure = msr( 'surv.cindex'),
  store_models = FALSE, 
  store_benchmark_result = FALSE,
  store_fselect_instance = TRUE,
  terminator = trm("none"),
  fselector = fs("exhaustive_search", max_features = 1)
)

## sequential
future::plan("sequential")

tic()
rr = resample(task, afs, resampling)
toc()
# 13 seconds

## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))

tic()
rr = resample(task, afs, resampling)
toc()
# 8 seconds

be-marc avatar Jun 01 '22 09:06 be-marc

@lenaeb I cannot reproduce the issue. Can you run the following test on your machine and confirm?

library(mlr3proba)
library(mlr3fselect)
library(tictoc)

task = TaskSurv$new(survival::rats, id = "right_censored", time = "time", event = "status", type = "right")
task$select(setdiff(task$feature_names, "sex"))

afs = AutoFSelector$new(
  learner = lrn("surv.coxph"),
  resampling = rsmp("cv", folds = 3),
  measure = msr("surv.cindex"),
  store_models = FALSE, 
  store_benchmark_result = FALSE,
  store_fselect_instance = TRUE,
  terminator = trm("none"),
  fselector = fs("exhaustive_search", max_features = 1)
)

# outer loop
resampling = rsmp("subsampling", repeats = 16, ratio = 0.7)
resampling$instantiate(task)

## benchmark grid
design = benchmark_grid(
  learners = afs,
  tasks = task,
  resamplings = resampling
)

## sequential
future::plan("sequential")

tic()
bmr = benchmark(design,store_models = TRUE, store_backends = FALSE) 
toc()
# 15 seconds

## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))

tic()
bmr = benchmark(design, store_models = TRUE, store_backends = FALSE) 
toc()
# 8 seconds

be-marc avatar Jun 01 '22 09:06 be-marc

The problem is still remaing.. here you can find the output when we run the script on the cluster:

library(mlr3proba)
#> Loading required package: mlr3
library(mlr3fselect)
library(tictoc)

task = TaskSurv$new(survival::rats, id = "right_censored", time = "time", event = "status", type = "right")
task$select(setdiff(task$feature_names, "sex"))

afs = AutoFSelector$new(
  learner = lrn("surv.coxph"),
  resampling = rsmp("cv", folds = 3),
  measure = msr("surv.cindex"),
  store_models = FALSE, 
  store_benchmark_result = FALSE,
  store_fselect_instance = TRUE,
  terminator = trm("none"),
  fselector = fs("exhaustive_search", max_features = 1)
)

# outer loop
resampling = rsmp("subsampling", repeats = 16, ratio = 0.7)
resampling$instantiate(task)

## benchmark grid
design = benchmark_grid(
  learners = afs,
  tasks = task,
  resamplings = resampling
)

lgr::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")

## sequential
future::plan("sequential")

tic()
bmr = benchmark(design,store_models = TRUE, store_backends = FALSE) 
toc()
#> 14.753 sec elapsed

## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))

tic()
bmr = benchmark(design, store_models = TRUE, store_backends = FALSE) 
toc()
#> 142.167 sec elapsed

Created on 2022-06-01 by the reprex package (v2.0.0)

Here on the local machine, the runtime is more or less the same...

library(mlr3proba)
#> Lade nötiges Paket: mlr3
library(mlr3fselect)
library(tictoc)

task = TaskSurv$new(survival::rats, id = "right_censored", time = "time", event = "status", type = "right")
task$select(setdiff(task$feature_names, "sex"))

afs = AutoFSelector$new(
  learner = lrn("surv.coxph"),
  resampling = rsmp("cv", folds = 3),
  measure = msr("surv.cindex"),
  store_models = FALSE, 
  store_benchmark_result = FALSE,
  store_fselect_instance = TRUE,
  terminator = trm("none"),
  fselector = fs("exhaustive_search", max_features = 1)
)

# outer loop
resampling = rsmp("subsampling", repeats = 16, ratio = 0.7)
resampling$instantiate(task)

## benchmark grid
design = benchmark_grid(
  learners = afs,
  tasks = task,
  resamplings = resampling
)

lgr::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")

## sequential
future::plan("sequential")

tic()
bmr = benchmark(design,store_models = TRUE, store_backends = FALSE) 
toc()
#> 17.09 sec elapsed

## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))

tic()
bmr = benchmark(design, store_models = TRUE, store_backends = FALSE) 
toc()
#> 14.58 sec elapsed

Created on 2022-06-01 by the reprex package (v2.0.1)

lenaeb avatar Jun 01 '22 12:06 lenaeb

More or less the same is okay. fs("exhaustive_search", max_features = 1) produces only 2 combinations and therefore finishes very quickly. Parallelization introduces too much overhead in this case.

It gets complicated when it only happens on the cluster. Do you have a dedicated machine there? And do you run the script with future::plan(list(future::tweak("multisession", workers = 2), "sequential")) on the cluster? @RaphaelS1 Can you run my last example on your local machine?

be-marc avatar Jun 01 '22 12:06 be-marc

Yes we have a dedicated machine there. Yes we use this code snip to run it on the cluster.

lenaeb avatar Jun 01 '22 13:06 lenaeb

Please try to update all packages and run the script again. Post your sessionInfo(). One package might be old and produces large objects which are slowly serialized.

be-marc avatar Jun 07 '22 08:06 be-marc

Only for mlr3learner ( --> 0.5.3) there was an update avaible, unfortunately it does not help.

So i updated the mlr3, mlr3fselect, mlr3proba, package to the dev version via remotes::install_github("mlr-org/**")

Here is the output, still the same..

library(mlr3proba)
#> Loading required package: mlr3
library(mlr3fselect)
library(tictoc)

task = TaskSurv$new(survival::rats, id = "right_censored", time = "time", event = "status", type = "right")
task$select(setdiff(task$feature_names, "sex"))

afs = AutoFSelector$new(
  learner = lrn("surv.coxph"),
  resampling = rsmp("cv", folds = 3),
  measure = msr("surv.cindex"),
  store_models = FALSE, 
  store_benchmark_result = FALSE,
  store_fselect_instance = TRUE,
  terminator = trm("none"),
  fselector = fs("exhaustive_search", max_features = 1)
)

# outer loop
resampling = rsmp("subsampling", repeats = 16, ratio = 0.7)
resampling$instantiate(task)

## benchmark grid
design = benchmark_grid(
  learners = afs,
  tasks = task,
  resamplings = resampling
)

lgr::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")

## sequential
future::plan("sequential")

tic()
bmr = benchmark(design,store_models = TRUE, store_backends = FALSE) 
toc()
#> 14.652 sec elapsed

## multisession
future::plan(list(future::tweak("multisession", workers = 2), "sequential"))

tic()
bmr = benchmark(design, store_models = TRUE, store_backends = FALSE) 
toc()
#> 133.025 sec elapsed

sessionInfo()
#> R version 4.0.5 (2021-03-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Red Hat Enterprise Linux 8.2 (Ootpa)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /apps/rocs/2020.08/cascadelake/software/OpenBLAS/0.3.9-GCC-9.3.0/lib/libopenblas_skylakexp-r0.3.9.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] tictoc_1.0.1      mlr3fselect_0.7.1 mlr3proba_0.4.12  mlr3_0.13.3-9000 
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.8.3         compiler_4.0.5       highr_0.9           
#>  [4] mlr3misc_0.10.0      tools_4.0.5          digest_0.6.29       
#>  [7] uuid_1.1-0           lattice_0.20-44      evaluate_0.15       
#> [10] checkmate_2.1.0      dictionar6_0.1.3     rlang_1.0.2         
#> [13] Matrix_1.3-3         reprex_2.0.0         cli_3.3.0           
#> [16] rstudioapi_0.13      bbotk_0.5.3          yaml_2.2.1          
#> [19] parallel_4.0.5       xfun_0.22            withr_2.5.0         
#> [22] stringr_1.4.0        knitr_1.33           fs_1.5.0            
#> [25] globals_0.15.0       grid_4.0.5           glue_1.6.2          
#> [28] data.table_1.14.2    listenv_0.8.0        param6_0.2.4        
#> [31] distr6_1.6.10        R6_2.5.1             future.apply_1.9.0  
#> [34] parallelly_1.31.1    survival_3.2-11      rmarkdown_2.11      
#> [37] lgr_0.4.3            magrittr_2.0.3       mlr3pipelines_0.4.1 
#> [40] splines_4.0.5        backports_1.4.1      codetools_0.2-18    
#> [43] htmltools_0.5.1.1    palmerpenguins_0.1.0 future_1.26.1       
#> [46] set6_0.2.4           paradox_0.9.0        ooplah_0.2.0        
#> [49] stringi_1.6.1        crayon_1.5.1

Created on 2022-06-07 by the reprex package (v2.0.0)

lenaeb avatar Jun 07 '22 10:06 lenaeb

Hi @lenaeb, this seems more like a techincal issue rather a problem with the mlr3proba. Out of curiosity, did you maybe got it to work the way you wanted in the end? Stackoverflow might be a better place to address this sort of questions :)

bblodfon avatar Dec 04 '23 09:12 bblodfon

Closed due to inactivity

bblodfon avatar Dec 18 '23 17:12 bblodfon