mlr3pipelines
mlr3pipelines copied to clipboard
PipeOpImputeOOR behaviour for missing only during predict
Basically https://github.com/mlr-org/mlr3pipelines/issues/677 again.
Currently we naively turn NA into the .__MISSING__ level for factor columns. If things are only missing during prediction, this introduces a new factor level. In the past, this generated a warning, but now it gives an error, so we have to change this.
We should introduce a new hyperparameter "create.empty.missing.level":
- if
FALSE(default), we do not introduce empty factor levels during training, so if there is a factor column with no missing values in train, it is not changed (i.e. current behaviour). If this column now has misings during predict, these are not imputed (new behavour, but mlr3 will now always throw an error if we don't do it like this in this case). This is almost like doingpo("fixfactors")afterpo("imputeoor"), except that we don't touch empty factor levels if they were there to begin with. - if
TRUE, we introduce the.__MISSING__level even for factor cols that do not have missing values during train. Then, if these cols have missings in predict, there is no error. Some learners may have trouble with factor cols that have empty levels in train, we therefore don't do this by default so we don't break existing scripts. (although realistically, this would have been a nice default, sincecreate.empty.missing.levelcan also be simulated usingselector_missing().)
This means that if there are missings in factor cols during predict where no missings were present during train, these are not imputed by default, so one has to add another impute PO if this should be a problem. However, this was pretty much already the case, since the old behaviour was to introduce new levels during predict, which would have been a problem for almost every learner; so existing code that anticipates having NAs in predict that were not in train probably use po("fixfactors").