mlr3
mlr3 copied to clipboard
resample() does not set data_prototype (and task_prototype), which some learners rely on
Hi, I'm using MLR3 on a Kaggle kernel and found issues with the resample
function. The error message mentions some issues with data.table
column selection and future.apply
.
I'm currently able to use mlr3
v0.16.1 and the latest release of mlr3extralerners, but forcing data.table
and future.apply
to not upgrade by default (as they are dependencies to both).
Reproducible code:
# Install packages
install.packages("skimr")
install.packages("Cubist")
install.packages("mlr3verse")
remotes::install_github("mlr-org/mlr3extralearners@*release")
> Installing package into ‘/usr/local/lib/R/site-library’
> (as ‘lib’ is unspecified)
>
> Installing package into ‘/usr/local/lib/R/site-library’
> (as ‘lib’ is unspecified)
>
> Installing package into ‘/usr/local/lib/R/site-library’
> (as ‘lib’ is unspecified)
>
> Downloading GitHub repo mlr-org/[email protected]
>
> data.table (1.14.8 -> 1.14.10) [CRAN]
> future (1.33.0 -> 1.33.1 ) [CRAN]
> future.apply (1.11.0 -> 1.11.1 ) [CRAN]
> mlr3 (0.17.0 -> 0.17.1 ) [CRAN]
> Installing 4 packages: data.table, future, future.apply, mlr3
# Modeling
library("mlr3")
task = tsk("boston_housing")
task$select(c("age", "b", "chas"))
learner = lrn("regr.randomForest", importance = "mse")
learner$train(task)
cv.results <- resample(task, learner, rsmp("cv", folds = 10))
> INFO [15:56:18.968] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 1/10)
> INFO [15:56:19.261] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 2/10)
> INFO [15:56:19.501] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 3/10)
> INFO [15:56:20.041] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 4/10)
> INFO [15:56:20.261] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 5/10)
> INFO [15:56:20.778] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 6/10)
> INFO [15:56:20.985] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 7/10)
> INFO [15:56:21.201] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 8/10)
> INFO [15:56:21.427] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 9/10)
> INFO [15:56:21.643] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 10/10)
> Error in eval(predvars, data, env): object 'age' not found
> Traceback:
>
> 1. resample(task, learner, rsmp("cv", folds = 10))
> 2. future_map(n, workhorse, iteration = seq_len(n), learner = grid$learner,
> . mode = grid$mode, MoreArgs = list(task = task, resampling = resampling,
> . store_models = store_models, lgr_threshold = lgr_threshold,
> . pb = pb))
> 3. future.apply::future_mapply(FUN, ..., MoreArgs = MoreArgs, SIMPLIFY = FALSE,
> . USE.NAMES = FALSE, future.globals = FALSE, future.packages = "mlr3",
> . future.seed = TRUE, future.scheduling = scheduling, future.chunk.size = chunk_size,
> . future.stdout = stdout)
> 4. future_xapply(FUN = FUN, nX = nX, chunk_args = dots, MoreArgs = MoreArgs,
> . get_chunk = function(X, chunk) lapply(X, FUN = `chunkWith[[`,
> . chunk), expr = expr, envir = envir, future.envir = future.envir,
> . future.globals = future.globals, future.packages = future.packages,
> . future.scheduling = future.scheduling, future.chunk.size = future.chunk.size,
> . future.stdout = future.stdout, future.conditions = future.conditions,
> . future.seed = future.seed, future.label = future.label, fcn_name = fcn_name,
> . args_name = args_name, debug = debug)
> 5. value(fs)
> 6. value.list(fs)
> 7. resolve(y, result = TRUE, stdout = stdout, signal = signal, force = TRUE)
> 8. resolve.list(y, result = TRUE, stdout = stdout, signal = signal,
> . force = TRUE)
> 9. signalConditionsASAP(obj, resignal = FALSE, pos = ii)
> 10. signalConditions(obj, exclude = getOption("future.relay.immediate",
> . "immediateCondition"), resignal = resignal, ...)
Session info:
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ranger_0.14.1 Cubist_0.4.2.1 lattice_0.22-5
[4] mlr3extralearners_0.7.1 mlr3_0.16.1 data.table_1.14.8
[7] future.apply_1.11.0 future_1.33.0 skimr_2.1.5
[10] ggridges_0.5.4 lubridate_1.9.3 forcats_1.0.0
[13] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[16] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[19] ggplot2_3.4.4 tidyverse_2.0.0 bigrquery_1.4.2
[22] httr_1.4.7
loaded via a namespace (and not attached):
[1] bit64_4.0.5 jsonlite_1.8.8 assertthat_0.2.1
[4] lgr_0.4.4 mlr3misc_0.13.0 remotes_2.4.2.1
[7] globals_0.16.2 pillar_1.9.0 backports_1.4.1
[10] glue_1.6.2 uuid_1.1-1 digest_0.6.33
[13] checkmate_2.3.1 colorspace_2.1-0 Matrix_1.6-4
[16] plyr_1.8.9 htmltools_0.5.7 pkgconfig_2.0.3
[19] listenv_0.9.0 scales_1.3.0 processx_3.8.2
[22] tzdb_0.4.0 timechange_0.2.0 generics_0.1.3
[25] withr_2.5.2 repr_1.1.6.9000 cli_3.6.1
[28] paradox_0.11.1 magrittr_2.0.3 crayon_1.5.2
[31] evaluate_0.23 ps_1.7.5 fs_1.6.3
[34] fansi_1.0.5 parallelly_1.36.0 pkgbuild_1.4.2
[37] palmerpenguins_0.1.1 tools_4.0.5 prettyunits_1.2.0
[40] hms_1.1.3 gargle_1.5.2 lifecycle_1.0.4
[43] munsell_0.5.0 callr_3.7.3 compiler_4.0.5
[46] rlang_1.1.2 grid_4.0.5 pbdZMQ_0.3-10
[49] IRkernel_1.3.2.9000 base64enc_0.1-3 gtable_0.3.4
[52] codetools_0.2-18 DBI_1.1.3 curl_5.1.0
[55] reshape2_1.4.4 R6_2.5.1 knitr_1.45
[58] fastmap_1.1.1 bit_4.0.5 utf8_1.2.4
[61] rprojroot_2.0.4 desc_1.4.2 stringi_1.8.2
[64] parallel_4.0.5 IRdisplay_1.1.0.9000 Rcpp_1.0.11
[67] vctrs_0.6.5 dbplyr_2.4.0 tidyselect_1.2.0
[70] xfun_0.41
``
`
Hey, sorry I can't reproduce the issue. I create a clean environment with renv
.
renv::init(bare = TRUE)
renv::install(c("[email protected]", "mlr-org/mlr3extralearners@*release", "randomForest"))
Your code runs without any problems.
task = tsk("boston_housing")
task$select(c("age", "b", "chas"))
learner = lrn("regr.randomForest", importance = "mse")
learner$train(task)
rr = resample(task, learner, rsmp("cv", folds = 10))
Session info.
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 23.10
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8
[8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Berlin
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] mlr3extralearners_0.7.1 mlr3_0.17.1
loaded via a namespace (and not attached):
[1] digest_0.6.33 backports_1.4.1 R6_2.5.1 codetools_0.2-19 randomForest_4.7-1.1 lgr_0.4.4 parallel_4.3.1 RhpcBLASctl_0.23-42 palmerpenguins_0.1.1
[10] mlr3misc_0.13.0 parallelly_1.36.0 pak_0.7.1 future_1.33.1 renv_1.0.3 data.table_1.14.10 compiler_4.3.1 paradox_0.11.1 globals_0.16.2
``
My Kaggle kernel has R 4.0 and the Ubuntu 20 installed by default. Not sure if I can change that. What do you recommend?
I can confirm that there is a bug on kaggle. It is not the subsetting of the task and not the task itself. The error does not occur with regr.rpart
but with regr.randomForest
and regr.ranger
. I cannot reproduce the bug on my local machine or in a rocker image with R 4.0.5. The error looks like mlr3 is not passing data to the predict
function of the upstream packages. Such an error would definitely have been noticed in our unit tests. Yes, that is quite tricky now. We can't debug easily on Kaggle.
https://www.kaggle.com/bemarc7832/issue-987
I believe the issue is this line in the randomforest learner:
https://github.com/mlr-org/mlr3extralearners/blob/5e291e0062347d24a263505e882dd9f409cb04ef/R/learner_randomForest_regr_randomForest.R#L113
This executes
task$data(cols = intersect(names(learner$state$data_prototype),
task$feature_names))
When I stop here, the learner's learner$state$data_prototype
is NULL
(this is the bug, see below), and, in modern R versions, the intersect()
is also NULL
leading to the call task$data(cols = NULL)
and all columns are returned.
However, in older R versions, intersect(NULL, <character>)
is not NULL
, it is character(0)
. This leads to task$data(cols = character(0))
being called, and ordered_features()
in the line linked above therefore returning a 0-column data.table
.
Idk when this new behaviour of intersect()
was introduced, it appears to be this diff and this entry in R 4.2.0 NEWS sounds matching:
The set utility functions, notably
intersect()
have been tweaked to be more consistent and symmetric in their two set arguments, also preserving a commonmode
.
.... although the timing does not seem to match. But somewhere between 4.1.2 and 4.2.0 I think. Too lazy to check.
Now to the bug in our code: I assume the problem is that resampling does not set the data_prototype
any more during resampling, since this patch. resample()
does not call the learner's train()
, so data_prototype
is not set.
(It may be unnecessary, currently, to set data_prototype in resampling, since the task remains the same, but this may change with the new holdout task thing that may be introduced. Also we should make sure other places handle data_prototype
being NULL
correctly)