sl3 icon indicating copy to clipboard operation
sl3 copied to clipboard

Delayed learner fit does not yield same result as regular fit when `Lrnr_randomForest` is used

Open tq21 opened this issue 2 years ago • 1 comments

When Lrnr_randomForest is a candidate learner in a super learner library, the delayed learner fit and regular fit do not yield the same prediction result.

Here is an example borrowed from test-delayed_sl3.R:

library(delayed)
library(SuperLearner)
library(future)

plan(sequential)
  
data(cpp_imputed)
task <- sl3_Task$new(
  cpp_imputed, 
  covariates = c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs", "sexn"),
  outcome = "haz")
  
lrnr_rf <- Lrnr_randomForest$new()
lrnr_glmnet <- Lrnr_glmnet$new()
lrnr_glm_fast <- Lrnr_glm_fast$new()
  
stack <- Stack$new(lrnr_rf, lrnr_glmnet, lrnr_glm_fast)
sl <- Lrnr_sl$new(learners = stack)

set.seed(123)
test_delayed <- delayed_learner_train(sl, task)
sched <- Scheduler$new(test_delayed, SequentialJob)
set.seed(123)
fit_delayed <- sched$compute()
preds_delayed <- fit_delayed$predict()

set.seed(123)
fit <- sl$train(task)
preds <- fit$predict()
expect_equal(as.numeric(preds_delayed), as.numeric(preds))

Running the code above, preds_delayed and preds do not match.

Note that if we remove Lrnr_randomForest, result is reproducible:

plan(sequential)

data(cpp_imputed)
task <- sl3_Task$new(
  cpp_imputed, 
  covariates = c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs", "sexn"),
  outcome = "haz")

lrnr_glmnet <- Lrnr_glmnet$new()
lrnr_glm_fast <- Lrnr_glm_fast$new()

stack <- Stack$new(lrnr_glmnet, lrnr_glm_fast)
sl <- Lrnr_sl$new(learners = stack)

set.seed(123)
test_delayed <- delayed_learner_train(sl, task)
sched <- Scheduler$new(test_delayed, SequentialJob)
set.seed(123)
fit_delayed <- sched$compute()

set.seed(123)
fit <- sl$train(task)

preds_delayed <- fit_delayed$predict()
preds <- fit$predict()

expect_equal(preds_delayed, preds)

Removing set.seed(123) above the line test_delayed <- delayed_learner_train(sl, task), preds_delayed and preds do not match.

tq21 avatar Nov 16 '22 00:11 tq21

So calling lrnr$train on any learner is going to be running delayed anyways. See: https://github.com/tlverse/sl3/blob/master/R/Lrnr_base.R#L224

The difference between the two runs is that by default we use FutureJob, not SequentialJob, which is controllable by this function: sl3:::sl3_delayed_job_type. Changing the example to be consistent about job type fixes the issue:

library(delayed)
library(future)
library(sl3)
library(testthat)
plan(sequential)

data(cpp_imputed)
task <- sl3_Task$new(
  cpp_imputed, 
  covariates = c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs", "sexn"),
  outcome = "haz")

lrnr_rf <- Lrnr_randomForest$new()

# FutureJob is reproducible
set.seed(123)
test_delayed <- delayed_learner_train(lrnr_rf, task)
sched <- Scheduler$new(test_delayed, FutureJob)
fit_delayed <- sched$compute()
preds_delayed <- fit_delayed$predict()

set.seed(123)
options(sl3.enable.future = TRUE) # the default
fit <- lrnr_rf$train(task)
preds <- fit$predict()
expect_equal(as.numeric(preds_delayed), as.numeric(preds))

# SequentialJob is reproducible
set.seed(123)
test_delayed <- delayed_learner_train(lrnr_rf, task)
sched <- Scheduler$new(test_delayed, SequentialJob)
fit_delayed <- sched$compute()
preds_delayed <- fit_delayed$predict()

set.seed(123)
options(sl3.enable.future = FALSE)
fit <- lrnr_rf$train(task)
preds <- fit$predict()
expect_equal(as.numeric(preds_delayed), as.numeric(preds))

The two job types currently handle seeds differently, which may be fixable, but I believe we changed FutureJob to fix other reproducibility issues, especially under parallelization. Maybe an issue on delayed is appropriate

jeremyrcoyle avatar Nov 16 '22 16:11 jeremyrcoyle