sl3 icon indicating copy to clipboard operation
sl3 copied to clipboard

Unify internal structure for predictions

Open osofr opened this issue 7 years ago • 5 comments

  • Currently using a matrix to store all predictions.
  • Should switch to data.table, with one column per model.
  • The column can be of arbitrary type (i.e., a list)) or numeric, depending on the regression problem (binary class / regression or multinomial classif.)

osofr avatar Aug 09 '17 01:08 osofr

I think this works for learners where the goal is predicting an outcome. However, it's not clear how this would work for "learners" where the goal is something else, for instance SL_Screener currently generates a matrix of screened covariates when predict() is called. This violates the "one column per model" paradigm. I think we need to allow learners to generate more than one column in the data table. We can still support stacking learners that generate arbitrary numbers of columns with cbind.

jeremyrcoyle avatar Aug 09 '17 01:08 jeremyrcoyle

This relates to currently enforced check: https://github.com/jeremyrcoyle/sl3/blob/master/tests/testthat/test_all_learners.R#L20

Can we switch that to test_that("Learner can generate training set predictions",expect_equal(nrow(train_preds),nrow(task$X))) for all "standard" learners? I.e., those that actually generate a) a single set of predictions of the new outcome (single column data.table) or b) several sets of predictions from different models, stored as data.table with several columns.

I still think that the best thing for screeners is to generate a predicted mutated design matrix, rather then the names of the selected covariates. In that case PCA can be used interchangeably with screeners, always generating the same types of the output.

Thoughts?

osofr avatar Aug 11 '17 17:08 osofr

By making it nrow instead of length, you're enforcing that predict returns something that's matrix shaped. Is that what you want? This seems like it might break user expectations, as when we implement an s3 predict.Learner method, it will return matricies/data.tables even from things like Learner_glm, where a user might instead expect something vector shaped.

Alternatively, we could support both learners that return vectors and learners that return matricies by using sl3::safe_dim(x)[1]

jeremyrcoyle avatar Aug 11 '17 18:08 jeremyrcoyle

Either way is fine with me. It seems like going for one prediction data type is always a safer and easier to maintain solution (I suggest that prediction always returns a data.table). We could also use your suggested approach for time being.

I would prefer to stay away from matrices as much as possible, unless we absolutely have to use them. When we do need to use, I think we should just perform the on the fly conversion as.matrix().

osofr avatar Aug 11 '17 18:08 osofr

okay, it sounds like the easiest thing is to change it to sl3::safe_dim(x)[1] for now, and then work on enforcing prediction types as a later effort

jeremyrcoyle avatar Aug 11 '17 18:08 jeremyrcoyle