mlr3torch icon indicating copy to clipboard operation
mlr3torch copied to clipboard

Allow virtual lazy tensors as targets in classification and regression

Open sebffischer opened this issue 7 months ago • 2 comments

This PR adds an experimental feature that allows to convert a torch::dataset to an mlr3::Task.

Essentially, the torch::dataset is converted to a data.table consisting only of lazy_tensor columns (including the target column). In order to make this compatible with the mlr3 API (measures etc.), it is necessary to provide a converter for the target column that converts from the torch_tensor to the associated R type:

library(mlr3torch)

# Binary Classification
x = torch_randn(100, 3)
beta = torch_randn(3, 1)
y = x$matmul(beta) + torch_randn(100, 1)
ds = tensor_dataset(
  x = x,
  y = y
)

task = as_task_regr(ds, target = "y", converter = list(y = as.numeric))

When accessing the data from the task, the lazy_tensor columns for those columns for which a converter exists are materialize()d and the converter is applied, making it seem like this is just a standard numeric().

task$head(2L)
#>             y             x
#>         <num> <lazy_tensor>
#> 1:  0.9878142     <tnsr[3]>
#> 2: -0.3822043     <tnsr[3]>

However, LearnerTorch avoids the conversion and can directly load the target tensors (as defined by the tensor_dataset above) during training.

mlp = lrn("regr.mlp", batch_size = 100, epochs = 50)

rr = resample(task, mlp, rsmp("cv", folds = 3))
rr$aggregate(msr("regr.rmse"))
#> regr.rmse 
#>   1.19302

Because the individual batches can only be loaded as a whole, this means that some data-access is more expensive. E.g., task$truth(1:10) needs to load all 10 batches even though we are only interested in the target.

For this reason, some operations are disallowed, such as target transformations or adding new rows to the task:

glrn = as_learner(ppl("targettrafo", mlp))
glrn$train(task)
#> Error in check_lazy_tensors_backend(bs$b1, candidates, visited): A converter column ('y') from a DataBackendLazyTensors was presumably preprocessed by some PipeOp. This can cause inefficiencies and is therefore not allowed. If you want to preprocess them, please directly encode them as R types.
#> This happened PipeOp regr.mlp's $train()

Furthermore, converted columns are cached, which is demonstrated below. On the second access to head, the counter of the dataset is not incremented and hence $.getbatch() was not called, but instead loaded from the cache.

ds = dataset(
  initialize = function(x, y) {
    self$x = torch_randn(100, 3)
    self$y = torch_randn(100, 1)
    self$counter = 0
  },
  .getbatch = function(i) {
    self$counter = self$counter + 1L
    list(x = self$x[i, drop = FALSE], y = self$y[i, drop = FALSE])
  },
  .length = function() 100
)()

task = as_task_regr(ds, target = "y")

counter = ds$counter
task$head()
#>              y             x
#>          <num> <lazy_tensor>
#> 1:  1.91739988     <tnsr[3]>
#> 2:  0.99552888     <tnsr[3]>
#> 3: -0.03263215     <tnsr[3]>
#> 4:  1.66325033     <tnsr[3]>
#> 5: -0.22850810     <tnsr[3]>
#> 6: -0.47497058     <tnsr[3]>
print(ds$counter - counter)
#> [1] 1
counter = ds$counter
task$head()
#>              y             x
#>          <num> <lazy_tensor>
#> 1:  1.91739988     <tnsr[3]>
#> 2:  0.99552888     <tnsr[3]>
#> 3: -0.03263215     <tnsr[3]>
#> 4:  1.66325033     <tnsr[3]>
#> 5: -0.22850810     <tnsr[3]>
#> 6: -0.47497058     <tnsr[3]>
print(ds$counter - counter)
#> [1] 0

Created on 2025-04-17 with reprex v2.1.1

Internally, this works via the DataBackendLazyTensors (TODO: describe this)

sebffischer avatar Apr 16 '25 15:04 sebffischer

@tdhock the PR is WIP, but I was wondering whether you can give some feedback on whether this seems to be useful to you and whether the API is intuitive?

sebffischer avatar Apr 17 '25 12:04 sebffischer

Hi thanks for the invite to review. I would like to, but I probably won't have time until the end of April.

tdhock avatar Apr 17 '25 16:04 tdhock