mlr3pipelines icon indicating copy to clipboard operation
mlr3pipelines copied to clipboard

PipeOpFeatureUnion should use DataBackend cbind

Open mb706 opened this issue 4 years ago • 2 comments

PipeOpFeatureUnion currently uses Task$data() to get the data to cbind. We can save some space and performance if we just cbind the DataBackend directly.

Difficulty: The DataBackends do not contain any information about col roles (in particular which columns removed from the data), so this has to be done after cbind intelligently.

mb706 avatar Apr 02 '20 16:04 mb706

So how much of improvement can we expect here. I recently changed the cbind_tasks method of PipeOpFeatureUnion to rely on modifications by reference via data.table where possible (so there are only the $data() calls in line 179 left which could be bad memorywise/speedwise). However, on my machine the current changes #358 already yield drastic improvements for "larger" data:

dat1 = matrix(rnorm(1000 * 100000), nrow = 100000, ncol = 1000)
colnames(dat1) = c("target", paste0("feature", 1:999))
dat1 = as.data.table(dat1)
task1 = TaskRegr$new("test1", backend = dat1, target = "target")

dat2 = cbind(dat1[, 1], matrix(rnorm(999 * 100000), nrow = 100000, ncol = 999))
colnames(dat2) = c("target", paste0("feature", 1000:1998))
dat2 = as.data.table(dat2)
task2 = TaskRegr$new("test2", backend = dat2, target = "target")

po = PipeOpFeatureUnion$new()
microbenchmark::microbenchmark(po$train(list(task1, task2)), times = 10L)
#!!!Unit: milliseconds
#                         expr      min       lq     mean   median      uq
# po$train(list(task1, task2)) 452.4882 543.9716 640.1103 601.5686 685.713
#      max neval
# 1046.604    10

vs. current master

#!!!Unit: seconds
#                         expr      min       lq     mean  median       uq
# po$train(list(task1, task2)) 191.6633 212.4702 215.7789 215.179 218.7174
#      max neval
# 243.1069    10

which is a median improvement of around the factor ~ 350.

sumny avatar Apr 03 '20 12:04 sumny

It would still be useful where we are using non-data.table backends. But it would also be a bit difficult, so this has low priority for now.

mb706 avatar Apr 12 '20 14:04 mb706