mlr3pipelines icon indicating copy to clipboard operation
mlr3pipelines copied to clipboard

PipeOpImputeOOR behaviour for missing only during predict

Open mb706 opened this issue 7 months ago • 3 comments

Basically https://github.com/mlr-org/mlr3pipelines/issues/677 again.

Currently we naively turn NA into the .__MISSING__ level for factor columns. If things are only missing during prediction, this introduces a new factor level. In the past, this generated a warning, but now it gives an error, so we have to change this.

We should introduce a new hyperparameter "create.empty.missing.level":

  • if FALSE (default), we do not introduce empty factor levels during training, so if there is a factor column with no missing values in train, it is not changed (i.e. current behaviour). If this column now has misings during predict, these are not imputed (new behavour, but mlr3 will now always throw an error if we don't do it like this in this case). This is almost like doing po("fixfactors") after po("imputeoor"), except that we don't touch empty factor levels if they were there to begin with.
  • if TRUE, we introduce the .__MISSING__ level even for factor cols that do not have missing values during train. Then, if these cols have missings in predict, there is no error. Some learners may have trouble with factor cols that have empty levels in train, we therefore don't do this by default so we don't break existing scripts. (although realistically, this would have been a nice default, since create.empty.missing.level can also be simulated using selector_missing().)

This means that if there are missings in factor cols during predict where no missings were present during train, these are not imputed by default, so one has to add another impute PO if this should be a problem. However, this was pretty much already the case, since the old behaviour was to introduce new levels during predict, which would have been a problem for almost every learner; so existing code that anticipates having NAs in predict that were not in train probably use po("fixfactors").

mb706 avatar Mar 24 '25 10:03 mb706