mlr3 icon indicating copy to clipboard operation
mlr3 copied to clipboard

Bootstrap resampling in combination with certain data transforms causes error

Open rjstretch opened this issue 2 years ago • 6 comments

I encountered an almost identical issue to that raised in this StackOverflow post.

Reproducible example:

library("mlr3verse")
library("mlr3oml")

# Task with missing data
task_oml <- OMLTask$new(54)
task <- task_oml$task
sum(is.na(task$data()))

# Learner / search space
learner = lrn("classif.ranger", id = "rf", predict_type = "prob")
search_space = ps(rf.sample.fraction = p_dbl(0.5, 1))

# Graph using pipeline_robustify to impute missing values
graph <-
	mlr3pipelines::pipeline_robustify(task, learner = learner, impute_missings = NULL) %>>%
	learner
graph_learner = as_learner(graph)

outer = rsmp("repeated_cv", folds = 5, repeats = 1)
# NOTE: Code works fine if the line below is changed to: inner = rsmp("cv", folds = 5)
inner = rsmp("bootstrap", repeats = 10)

rr = tune_nested(
	task = task,
	learner = graph_learner,
	inner_resampling = inner,
	outer_resampling = outer,
	search_space = search_space,
	term_evals = 10L,
	method = "random_search"
)

This results in the following error:

Error in as_data_backend.data.frame(data, primary_key = row_ids) :                                                                                                                       
  Assertion on 'primary_key' failed: Contains duplicated values, position 5.
This happened PipeOp imputehist's $train()

There is another reproducible example in the StackOverflow post.

The error goes away if I use k-fold CV instead of bootstrap for the inner resampling.

rjstretch avatar Jun 28 '22 13:06 rjstretch

@mb706 @be-marc @sebffischer

mllg avatar Oct 19 '22 11:10 mllg

isnt that a pipelines issue?

berndbischl avatar Oct 19 '22 12:10 berndbischl

mlr3pipelines makes the assumption that the following works:

task <- tsk("iris")
task$clone(deep = TRUE)$select(character(0))$cbind(task$data())

i.e. overwrite features with other features by cbinding them. This example is a bit useless, but we could also select a few features and cbind other features. I think this is a reasonable assumption, so I don't think mlr3pipelines is broken here.

The problem is that this operation fails when a bootstrap resampling has been applied. This is done here, a minimal example is this:

task <- tsk("iris")
rs <- rsmp("bootstrap", repeats = 1)$instantiate(task)
task$row_roles$use = rs$train_set(1)

# -- the operation from above fails now

task$clone(deep = TRUE)$select(character(0))$cbind(task$data())
#> Error in as_data_backend.data.frame(data, primary_key = row_ids) : 
#>  Assertion on 'primary_key' failed: Contains duplicated values, position 3.

It gives an error because $cbind() doesn't work with duplicate $use row roles. At some point we were saying that row roles should not be duplicated, in that case the resasmple() code should behave differently when given bootstrap resampling. We could also have a longer discussion about how to fix the DataBackendCbind.

mb706 avatar Oct 19 '22 14:10 mb706

Just pitching in that I faced this issue on one of my pipelines today as well. Is there a workaround I could use?

Interestingly, I only get the error at the end of the 5th iteration.

INFO  [17:49:27.673] [mlr3] Running benchmark with 8 resampling iterations

### 3 FOLD CV HERE, WORKS FINE ---
INFO  [17:49:27.740] [mlr3] Applying learner 'pca.encodeimpact.balanced_sampling.classif.xgboost' on task 'all_factors' (iter 1/3)
INFO  [17:49:48.393] [mlr3] Applying learner 'pca.encodeimpact.balanced_sampling.classif.xgboost' on task 'all_factors' (iter 2/3)
\NFO  [17:50:07.023] [mlr3] Applying learner 'pca.encodeimpact.balanced_sampling.classif.xgboost' on task 'all_factors' (iter 3/3)

### 5 REPEAT BOOTSTRAP HERE, ITERATIONS WORK FINE, THEN FAIL(?) ---
\NFO  [17:50:25.196] [mlr3] Applying learner 'pca.encodeimpact.balanced_sampling.classif.xgboost' on task 'all_factors' (iter 1/5)
|NFO  [17:50:25.963] [mlr3] Applying learner 'pca.encodeimpact.balanced_sampling.classif.xgboost' on task 'all_factors' (iter 2/5)
/NFO  [17:50:26.735] [mlr3] Applying learner 'pca.encodeimpact.balanced_sampling.classif.xgboost' on task 'all_factors' (iter 3/5)
INFO  [17:50:27.499] [mlr3] Applying learner 'pca.encodeimpact.balanced_sampling.classif.xgboost' on task 'all_factors' (iter 4/5)
INFO  [17:50:28.239] [mlr3] Applying learner 'pca.encodeimpact.balanced_sampling.classif.xgboost' on task 'all_factors' (iter 5/5)

Error:
Assertion on 'primary_key' failed: Contains duplicated values, position 2.
This happened PipeOp pca's $train()

I don't have a reprex for you at the moment, but the pipeline is:

po("pca",
    affect_columns = selector_grep("temp"))  %>>%
    po("encodeimpact", affect_columns = selector_type("factor")) %>>%
    po("classbalancing",
        id = "balanced_sampling",
        adjust = "all",
        reference = "all",
        ratio = 1
    )

rsangole avatar Oct 28 '22 18:10 rsangole

@rsangole What value for ratio are you using for your bootstrap, and what is the sample size of your task?

mb706 avatar Oct 31 '22 04:10 mb706

The default ratio value [rsmp("bootstrap", repeats = 5)]. The number of rows is ~1.8million.

rsangole avatar Oct 31 '22 04:10 rsangole