mlr3pipelines Assertion on 'primary_key' failed: Contains duplicated values

Assertion on 'primary_key' failed: Contains duplicated values

Open wangbaili opened this issue 2 years ago • 6 comments

Hello!

Thank you again for the R implementation of the mlr3.

I want to po the encode scale to the survival models（deepsurv),but have some trouble. this my codes

library(readxl)
library(mlr3)
library(mlr3benchmark)
library(mlr3cluster)
library(mlr3data)
library(mlr3filters)
library(mlr3fselect)
library(mlr3learners)
library(mlr3measures)
library(mlr3pipelines)
library(mlr3proba)
library(mlr3tuningspaces)
library(mlr3viz)
library(mlr3extralearners)
library(tableone)


es5<- read_excel("es5.xlsx")
es5[,7:19]<-lapply(seer5[,7:19],function(x)as.factor(as.character(x)))
task5<-TaskSurv$new("task5",es5, time = "time5", event = "status5")
resampling5 <- rsmp("bootstrap", ratio=0.7,repeats=3)

coder=po("encode", method = "treatment", affect_columns = selector_type("factor"))
scaler=po("scale",affect_columns = selector_type("numeric"))
learner_po = po("learner", lrn("surv.deepsurv", early_stopping =F,  optimizer = "adam",dropout=0.13866,learning_rate=0.3871,	alpha=0.160,num_nodes = c(169L, 169L,169L, 169L,169L, 169L,169L, 169L)))

graph=coder%>>%scaler%>>%learner_po

deepsurv5ln<- as_learner(graph)
resampling5 <- rsmp("bootstrap", ratio=0.7,repeats=3)
design <- benchmark_grid(task5, learners, resampling5)
bm <- benchmark(design)

when i ran bm ，get this error:

Error in as_data_backend.data.frame(data, primary_key = row_ids) : 
  Assertion on 'primary_key' failed: Contains duplicated values, position 2.
This happened PipeOp encode's $train()

I dont undertand this

Thanks again as I await your suggestion

Feb 13 '22 14:02 wangbaili

Hi, could you provide us with the column names of your es5 dataset?

What would help even more would be a minimal reproducible code example that we can actually run, i.e. including all the data that is being used.

Mar 07 '22 12:03 mb706

My assumption is that the "encode" PipeOp creates a column that is named ..row_id, which confuses mlr3 since it is in some way a reserved column name.

Mar 07 '22 12:03 mb706

sorry for long time no reply,I have to solve some health problem. The data is confidential,but I got same encode problem at this dataaa1.xlsx. This data is all factor except (event="status",time="time") Thanks again as I await your suggestion

Mar 18 '22 16:03 wangbaili

This the code

aa <- read_excel("C:/Users/LENOVO/Desktop/aa/aa1.xlsx") names(aa)

aa[,3:13]<-lapply(aa[,3:13],function(x)as.factor(as.character(x))) taskwork<-TaskSurv$new("taskwork",aa, time = "time", event = "status") learners <- lrns(paste0("surv.", c("coxtime", "deephit", "deepsurv", "loghaz", "pchazard")), frac = 0.3, early_stopping = TRUE, epochs = 10, optimizer = "adam" ) create_pipeops <- function(learner) { po("encode",method = "treatment") %>>% po("learner", learner) } learners <- lapply(learners, create_pipeops)

resampling <- rsmp("bootstrap", ratio=0.6,repeats=10) design <- benchmark_grid(taskwork,learners , resampling) bm <- benchmark(design)

Mar 18 '22 17:03 wangbaili

This is the error：

Error in as_data_backend.data.frame(data, primary_key = row_ids) : Assertion on 'primary_key' failed: Contains duplicated values, position 2. This happened PipeOp encode's $train()

Mar 18 '22 17:03 wangbaili

Thanks! Apparently the problem is that bootstrapping uses some rows repeatedly, which somehow breaks with mlr3's assumption that row_ids are unique values.

Minimal example:

library("mlr3")
library("mlr3pipelines")
options(mlr3.debug=TRUE)
resample(tsk("iris"), po("pca") %>>% lrn("classif.featureless"), rsmp("bootstrap"))

I will try to take care of this soon, until then a workaround would be to use a different resampling method (e.g. rsmp("cv") instead of rsmp("bootstrap")).

Apr 27 '22 21:04 mb706

mlr3pipelines mlr3pipelines copied to clipboard

Assertion on 'primary_key' failed: Contains duplicated values

mlr3pipelines
mlr3pipelines copied to clipboard