mlr3pipelines icon indicating copy to clipboard operation
mlr3pipelines copied to clipboard

Assertion on 'primary_key' failed: Contains duplicated values

Open wangbaili opened this issue 2 years ago • 6 comments

Hello!

Thank you again for the R implementation of the mlr3.

I want to po the encode scale to the survival models(deepsurv),but have some trouble. this my codes

library(readxl)
library(mlr3)
library(mlr3benchmark)
library(mlr3cluster)
library(mlr3data)
library(mlr3filters)
library(mlr3fselect)
library(mlr3learners)
library(mlr3measures)
library(mlr3pipelines)
library(mlr3proba)
library(mlr3tuningspaces)
library(mlr3viz)
library(mlr3extralearners)
library(tableone)


es5<- read_excel("es5.xlsx")
es5[,7:19]<-lapply(seer5[,7:19],function(x)as.factor(as.character(x)))
task5<-TaskSurv$new("task5",es5, time = "time5", event = "status5")
resampling5 <- rsmp("bootstrap", ratio=0.7,repeats=3)

coder=po("encode", method = "treatment", affect_columns = selector_type("factor"))
scaler=po("scale",affect_columns = selector_type("numeric"))
learner_po = po("learner", lrn("surv.deepsurv", early_stopping =F,  optimizer = "adam",dropout=0.13866,learning_rate=0.3871,	alpha=0.160,num_nodes = c(169L, 169L,169L, 169L,169L, 169L,169L, 169L)))

graph=coder%>>%scaler%>>%learner_po

deepsurv5ln<- as_learner(graph)
resampling5 <- rsmp("bootstrap", ratio=0.7,repeats=3)
design <- benchmark_grid(task5, learners, resampling5)
bm <- benchmark(design)

when i ran bm ,get this error:

Error in as_data_backend.data.frame(data, primary_key = row_ids) : 
  Assertion on 'primary_key' failed: Contains duplicated values, position 2.
This happened PipeOp encode's $train()

I dont undertand this

Thanks again as I await your suggestion

wangbaili avatar Feb 13 '22 14:02 wangbaili

Hi, could you provide us with the column names of your es5 dataset?

What would help even more would be a minimal reproducible code example that we can actually run, i.e. including all the data that is being used.

mb706 avatar Mar 07 '22 12:03 mb706

My assumption is that the "encode" PipeOp creates a column that is named ..row_id, which confuses mlr3 since it is in some way a reserved column name.

mb706 avatar Mar 07 '22 12:03 mb706

sorry for long time no reply,I have to solve some health problem. The data is confidential,but I got same encode problem at this dataaa1.xlsx. This data is all factor except (event="status",time="time") Thanks again as I await your suggestion

wangbaili avatar Mar 18 '22 16:03 wangbaili

This the code

aa <- read_excel("C:/Users/LENOVO/Desktop/aa/aa1.xlsx") names(aa)

aa[,3:13]<-lapply(aa[,3:13],function(x)as.factor(as.character(x))) taskwork<-TaskSurv$new("taskwork",aa, time = "time", event = "status") learners <- lrns(paste0("surv.", c("coxtime", "deephit", "deepsurv", "loghaz", "pchazard")), frac = 0.3, early_stopping = TRUE, epochs = 10, optimizer = "adam" ) create_pipeops <- function(learner) { po("encode",method = "treatment") %>>% po("learner", learner) } learners <- lapply(learners, create_pipeops)

resampling <- rsmp("bootstrap", ratio=0.6,repeats=10) design <- benchmark_grid(taskwork,learners , resampling) bm <- benchmark(design)

wangbaili avatar Mar 18 '22 17:03 wangbaili

This is the error:

Error in as_data_backend.data.frame(data, primary_key = row_ids) : Assertion on 'primary_key' failed: Contains duplicated values, position 2. This happened PipeOp encode's $train()

wangbaili avatar Mar 18 '22 17:03 wangbaili

Thanks! Apparently the problem is that bootstrapping uses some rows repeatedly, which somehow breaks with mlr3's assumption that row_ids are unique values.

Minimal example:

library("mlr3")
library("mlr3pipelines")
options(mlr3.debug=TRUE)
resample(tsk("iris"), po("pca") %>>% lrn("classif.featureless"), rsmp("bootstrap"))

I will try to take care of this soon, until then a workaround would be to use a different resampling method (e.g. rsmp("cv") instead of rsmp("bootstrap")).

mb706 avatar Apr 27 '22 21:04 mb706