mlr3pipelines
mlr3pipelines copied to clipboard
Assertion on 'primary_key' failed: Contains duplicated values
Hello!
Thank you again for the R implementation of the mlr3.
I want to po the encode scale to the survival models(deepsurv),but have some trouble. this my codes
library(readxl)
library(mlr3)
library(mlr3benchmark)
library(mlr3cluster)
library(mlr3data)
library(mlr3filters)
library(mlr3fselect)
library(mlr3learners)
library(mlr3measures)
library(mlr3pipelines)
library(mlr3proba)
library(mlr3tuningspaces)
library(mlr3viz)
library(mlr3extralearners)
library(tableone)
es5<- read_excel("es5.xlsx")
es5[,7:19]<-lapply(seer5[,7:19],function(x)as.factor(as.character(x)))
task5<-TaskSurv$new("task5",es5, time = "time5", event = "status5")
resampling5 <- rsmp("bootstrap", ratio=0.7,repeats=3)
coder=po("encode", method = "treatment", affect_columns = selector_type("factor"))
scaler=po("scale",affect_columns = selector_type("numeric"))
learner_po = po("learner", lrn("surv.deepsurv", early_stopping =F, optimizer = "adam",dropout=0.13866,learning_rate=0.3871, alpha=0.160,num_nodes = c(169L, 169L,169L, 169L,169L, 169L,169L, 169L)))
graph=coder%>>%scaler%>>%learner_po
deepsurv5ln<- as_learner(graph)
resampling5 <- rsmp("bootstrap", ratio=0.7,repeats=3)
design <- benchmark_grid(task5, learners, resampling5)
bm <- benchmark(design)
when i ran bm ,get this error:
Error in as_data_backend.data.frame(data, primary_key = row_ids) :
Assertion on 'primary_key' failed: Contains duplicated values, position 2.
This happened PipeOp encode's $train()
I dont undertand this
Thanks again as I await your suggestion
Hi, could you provide us with the column names of your es5
dataset?
What would help even more would be a minimal reproducible code example that we can actually run, i.e. including all the data that is being used.
My assumption is that the "encode"
PipeOp
creates a column that is named ..row_id
, which confuses mlr3 since it is in some way a reserved column name.
sorry for long time no reply,I have to solve some health problem. The data is confidential,but I got same encode problem at this dataaa1.xlsx. This data is all factor except (event="status",time="time") Thanks again as I await your suggestion
This the code
aa <- read_excel("C:/Users/LENOVO/Desktop/aa/aa1.xlsx") names(aa)
aa[,3:13]<-lapply(aa[,3:13],function(x)as.factor(as.character(x))) taskwork<-TaskSurv$new("taskwork",aa, time = "time", event = "status") learners <- lrns(paste0("surv.", c("coxtime", "deephit", "deepsurv", "loghaz", "pchazard")), frac = 0.3, early_stopping = TRUE, epochs = 10, optimizer = "adam" ) create_pipeops <- function(learner) { po("encode",method = "treatment") %>>% po("learner", learner) } learners <- lapply(learners, create_pipeops)
resampling <- rsmp("bootstrap", ratio=0.6,repeats=10) design <- benchmark_grid(taskwork,learners , resampling) bm <- benchmark(design)
This is the error:
Error in as_data_backend.data.frame(data, primary_key = row_ids) : Assertion on 'primary_key' failed: Contains duplicated values, position 2. This happened PipeOp encode's $train()
Thanks! Apparently the problem is that bootstrapping uses some rows repeatedly, which somehow breaks with mlr3's assumption that row_ids are unique values.
Minimal example:
library("mlr3")
library("mlr3pipelines")
options(mlr3.debug=TRUE)
resample(tsk("iris"), po("pca") %>>% lrn("classif.featureless"), rsmp("bootstrap"))
I will try to take care of this soon, until then a workaround would be to use a different resampling method (e.g. rsmp("cv")
instead of rsmp("bootstrap")
).