sl3
sl3 copied to clipboard
Delayed learner fit does not yield same result as regular fit when `Lrnr_randomForest` is used
When Lrnr_randomForest
is a candidate learner in a super learner library, the delayed learner fit and regular fit do not yield the same prediction result.
Here is an example borrowed from test-delayed_sl3.R
:
library(delayed)
library(SuperLearner)
library(future)
plan(sequential)
data(cpp_imputed)
task <- sl3_Task$new(
cpp_imputed,
covariates = c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs", "sexn"),
outcome = "haz")
lrnr_rf <- Lrnr_randomForest$new()
lrnr_glmnet <- Lrnr_glmnet$new()
lrnr_glm_fast <- Lrnr_glm_fast$new()
stack <- Stack$new(lrnr_rf, lrnr_glmnet, lrnr_glm_fast)
sl <- Lrnr_sl$new(learners = stack)
set.seed(123)
test_delayed <- delayed_learner_train(sl, task)
sched <- Scheduler$new(test_delayed, SequentialJob)
set.seed(123)
fit_delayed <- sched$compute()
preds_delayed <- fit_delayed$predict()
set.seed(123)
fit <- sl$train(task)
preds <- fit$predict()
expect_equal(as.numeric(preds_delayed), as.numeric(preds))
Running the code above, preds_delayed
and preds
do not match.
Note that if we remove Lrnr_randomForest
, result is reproducible:
plan(sequential)
data(cpp_imputed)
task <- sl3_Task$new(
cpp_imputed,
covariates = c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs", "sexn"),
outcome = "haz")
lrnr_glmnet <- Lrnr_glmnet$new()
lrnr_glm_fast <- Lrnr_glm_fast$new()
stack <- Stack$new(lrnr_glmnet, lrnr_glm_fast)
sl <- Lrnr_sl$new(learners = stack)
set.seed(123)
test_delayed <- delayed_learner_train(sl, task)
sched <- Scheduler$new(test_delayed, SequentialJob)
set.seed(123)
fit_delayed <- sched$compute()
set.seed(123)
fit <- sl$train(task)
preds_delayed <- fit_delayed$predict()
preds <- fit$predict()
expect_equal(preds_delayed, preds)
Removing set.seed(123)
above the line test_delayed <- delayed_learner_train(sl, task)
, preds_delayed
and preds
do not match.
So calling lrnr$train
on any learner is going to be running delayed anyways. See: https://github.com/tlverse/sl3/blob/master/R/Lrnr_base.R#L224
The difference between the two runs is that by default we use FutureJob
, not SequentialJob
, which is controllable by this function: sl3:::sl3_delayed_job_type
. Changing the example to be consistent about job type fixes the issue:
library(delayed)
library(future)
library(sl3)
library(testthat)
plan(sequential)
data(cpp_imputed)
task <- sl3_Task$new(
cpp_imputed,
covariates = c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs", "sexn"),
outcome = "haz")
lrnr_rf <- Lrnr_randomForest$new()
# FutureJob is reproducible
set.seed(123)
test_delayed <- delayed_learner_train(lrnr_rf, task)
sched <- Scheduler$new(test_delayed, FutureJob)
fit_delayed <- sched$compute()
preds_delayed <- fit_delayed$predict()
set.seed(123)
options(sl3.enable.future = TRUE) # the default
fit <- lrnr_rf$train(task)
preds <- fit$predict()
expect_equal(as.numeric(preds_delayed), as.numeric(preds))
# SequentialJob is reproducible
set.seed(123)
test_delayed <- delayed_learner_train(lrnr_rf, task)
sched <- Scheduler$new(test_delayed, SequentialJob)
fit_delayed <- sched$compute()
preds_delayed <- fit_delayed$predict()
set.seed(123)
options(sl3.enable.future = FALSE)
fit <- lrnr_rf$train(task)
preds <- fit$predict()
expect_equal(as.numeric(preds_delayed), as.numeric(preds))
The two job types currently handle seeds differently, which may be fixable, but I believe we changed FutureJob to fix other reproducibility issues, especially under parallelization. Maybe an issue on delayed is appropriate