mlr3pipelines
mlr3pipelines copied to clipboard
predict on test set rows
go through each pipeop and let it predict directly, since some pipeops need the predicted set
- PipeOpTaskPreproc checks if there are test set rows, if so calls .predict_task
- tests should verify that all pipeops that have
"Task"inputs and outputs handle this correctly:$predict()and$train()[test set rows, ]are the same. The autotests have some flag that indicates if$predict()is random? inexpect_datapreproc_pipeop_class - some pipeops that do task preprocessing but dont inherit from pipeoptaskpreproc are worrying here: pipeopimpute. It should probably just inherit from pipeoptaskpreproc.
- PipeOpTaskPreproc always handles test rows if they are present. GraphLearner removes test rows in most cases, unless a PipeOp needs them. Determines this by checking each PipeOp: Does it contain a
$uses_test_rowsactive binding / field? (choose not to just look for attached$learner, if present. Problem here: e.g. ImputeLearner should not forward its 'needs-test-set' since it can not be used) - GraphLearner can set its own relevant property
There is a memory inefficiency with the suggested approach:
Let's say we have a task with data of the form
| row_id | y | x1 |
|---|---|---|
| 1 | ||
| ... | ||
| n_use | ||
| n_use + 1 | ||
| ... | ||
| n_use + n_test |
Lets say we apply two preprocessing operations to x1 in two subsequent PipeOps.
In the approach we discussed, we first task$cbind() the preprocessed train rows (pipeop$.train_task()and then task$rbind() the preprocessed test rows, resulting from pipeop$.predict_task().
The problem with this is now, that the task$rbind() of the test rows, happens in both preprocessing PipeOps.
This means in both PipeOps, a table of the form below is rbinded.
| row_id | y | x1 |
|---|---|---|
| 1 | ||
| ... | ||
| n_test |
The wasteful thing here is that we rbind column y, as it contains information that is simply already present in the tasks's backend and is now being duplicated. This problem comes worse when there are many columns that are left untouched by a preprocessing pipeop.
The virtual backend that would result of applying two preprocessing operations on x1 would internally store n_use + n_test + n_test * n_pipeop rows, (containing all un-preprocessed columns + newly created columns) which is not so great.
Fortunately there is a more memory efficient way to achieve the same results, which works (at least) for the standard scenario of having distinct and unique use rows and test rows (which is usually the case, except for bootstrapping and insample resampling). That approach requires the following:
- we modify
mlr3'stask$cbind()function, to support$cbind()-ing test rows (straightforward) - Preprocessing
PipeOps must (I think) call$.train_dt()and$.predict_dt(), then rbind their result and cbind them to the task. (not so straightforward, because we are changing the structure ofPipeOpTaskPreproc)
In addition to that, retrieving the use roles will become more and more expensive, because of the many DataBackendRbinds. This is, because calling databackendrbind$data(), first goes through the rbinded backend (b2) to find the rows and then through b1.
Some possible solutions:
- Accept this inefficiency (I am not a huge fan)
- Hack our way around it (removing the internal cbinded backend containing the preprocesses
userows after$.train_task(), then rbind its data with the preprocessed test rows and$cbind()it again to the task) (I am also not a huge fan) - Change the
$.train_task()method everywhere (maybe (?)) - Somehow not
$rbind()thetestandusetasks, to avoid the problem (I think this should be considered)