caret
caret copied to clipboard
Imputing mixed numeric/categorical data within train preProc?
Is it possible to impute mixed numeric/categorical data within train
's preProc
argument? I want to impute within train
's cross validation, thereby accounting for how uncertainty in imputations affects estimation of generalization error.
The ?preProcess
help page suggests it is not possible to impute categorical variables:
x : a matrix or data frame. Non-numeric predictors are allowed but will be ignored.
However, the bagImpute
method can handle mixed data, in theory. The following code runs, but I am not sure if it is actually imputing the missing factor or simply removing patients with missing factor values:
library(caret);
#> Loading required package: ggplot2
#> Loading required package: lattice
data(iris);
nrow(iris);
#> [1] 150
iris.miss <- iris;
iris.miss[1,'Species'] <- NA;
iris.miss[2,'Petal.Length'] <- NA;
set.seed(1);
fit <- train(
Sepal.Length ~ .,
data = iris.miss,
method = 'lm',
preProc = 'bagImpute',
na.action = na.pass
);
fit
#> Linear Regression
#>
#> 150 samples
#> 4 predictor
#>
#> Pre-processing: bagged tree imputation (5)
#> Resampling: Bootstrapped (25 reps)
#> Summary of sample sizes: 150, 150, 150, 150, 150, 150, ...
#> Resampling results:
#>
#> RMSE Rsquared MAE
#> 0.3176759 0.8587222 0.2604171
#>
#> Tuning parameter 'intercept' was held constant at a value of TRUE
Notice the printed fit
says that all 150 patients were included, thus suggesting the missing factor was imputed, although I suspect that patient is simply being removed from the model and not imputed?
Created on 2023-07-23 by the reprex package (v2.0.1)