mice
mice copied to clipboard
Request: make predictions possible after using pool()
When using mice multiple imputation to generate multiple imputed sets and then running a model on each one of them (like elastic-net from the glmnet
package) and then pooling the estimates to a single set - its not possible to use predict.glmnet()
(or predict()
) on the object that comes out of mice::pool()
.
The prediction functions expect a fitted model object while mice::pool()
returns
An object of class mipo, which stands for 'multiple imputation pooled outcome'.
Is it possible to add the ability to make predictions from the return object of mice::pool()
?
There are actually multiple ways to create predictions from multiply imputed data. See Obtaining Predictions from Models Fit to Multiply Imputed Data for a comparison of two methods. I tend to deviate from Miles' recommendation, and would generally prefer Predict then combine
because that corresponds to Rubin's rules and accounts for nonlinearities at the proper place.
Also see the discussion by Wood et al., who highlight the decisions to be made when accounting for model optimism.
It should be relatively straightforward to implement any of these approaches in mice
.
Just want to add that this feature would be quite useful. I'm a bit surprised by the recommendations in the Miles paper, it doesn't really show the regimes where combine and predict would break down substantially.
Any word on this feature? Would be very helpful!
I have done some thinking on mice prediction workflows. Here are two possibilities, one involving only a train datasets, another involving both a train and test dataset.
A
One data set: train, $n$ rows, $m$ imputations:
- Impute the missing data;
- Fit the complete-data model to each imputed data set;
- Obtain predictions per imputed data set;
- Pool predictions and prediction errors
- Bind the result to train.
1. Standard mice: imp <- mice(train, …)
2. Standard complete-data analysis: fit <- with(imp, mymodel, …)
3. Calculate predictions: predlist <- lapply(fit$analysis, broom::augment.xxx, se.fit = TRUE, …) - We may need the data argument for some procedures. Perhaps use map().
4. Apply Rubin's rules to pool predictions: predmat <- pool(predlist, …), which results in a n * 3 matrix: .fitted, .residuals, .se.fit
5. Bind to data: ready <- bind_cols(train, predmat)
- Supports all broom::augment.xxx() functions.
- Steps 3 and 4 can jointly form an
augment.mira()
function, a kind of meta-augment.
—
B
Two data sets: train ($n_0$ rows) and test ($n_1$ rows), $m$ imputations.
- Estimate the imputation model on train only;
- Impute the missing data in both train and test;
- Fit the complete-data model to each imputed data set using train only;
- Obtain predictions per imputed data set in test only;
- Pool predictions and prediction errors;
- Bind the result to test.
1. Standard mice: imp.train <- mice(train, …)
2. Impute test data: imp.test <_ mice.mids(imp.train, newdata = test)
3. fit <- with(imp.test, mymodel, …)
4. Apply steps A3 and A4
5. Bind to data: ready <- bind_cols(test, predmat)
—
A and B are extreme scenarios, and obviously, we can mix aspects of both.
What I would like to know is whether A or B as described would be useful for applications. Suggestions for alternatives and tweaks are welcome.
I came to request exactly option B. My application is to use MICE as part of a processing pipeline for predictive models, where the standard is to have a true hold-out set. In deployment I also don't want to update the estimation - just apply it (including the imputation model) to the example at hand. It's kind of complicated with predictive mean matching because (I think) the complete training dataset is needed at inference time to select the replacements at each step. Exporting the model to something light weight would be huge.
I'd be very happy with A.
Dear all, I'd like to know the current status of this issue. In fact, I only need the pooled predicted values, not their standard errors, so I suppose averaging the predictions would be enough. If mice does not provide that yet, would this script be sensible (13 imputations)?:
fit <- with(impsppb, glm(compo ~ edad+sexo+fumarw2,family = quasibinomial))
for (i in 1:13) {
assign(paste0("fit",i), data.frame(fit[["analyses"]][[i]][["data"]][["hi1"]],fit[["analyses"]][[i]][["fitted.values"]]))
assign(paste0("fit",i), names<-
(get(paste0("fit",i)), c("hi1", "pred")))
}
fittot<-Reduce(function(x,y) merge(x = x, y = y, by = "hi1"),
list(fit1,fit2,fit3,fit4,fit5,fit6,fit7,fit8,fit9,fit10,fit11,fit12,fit13))
fittot$predef <- rowMeans(fittot[,c(2:14)], na.rm=TRUE)
fittot <- fittot[c(1,15)]
Thank you.
Option A would be spectacular. It's exactly what I'd want to teach to my students.
Another vote for Option A!! I came looking for exactly that.
For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/
For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/
🤗 you are my favorite person @ASKurz
Cheers! 🍻
Thank you so much @ASKurz, just what I needed
The script below uses the @ASKurz example in a more micy way. It is essentially Rubin's rules applied to model predictions instead of model parameters.
library(mice)
library(dplyr)
# generate data
set.seed(201)
b_small <-
brandsma %>%
filter(!complete.cases(.)) %>%
slice_sample(n = 50) %>%
select(ses, iqv, iqp)
# impute & analyse as usual
imp <- mice(b_small, seed = 540, m = 10, method = "norm", print = FALSE)
fit <- with(imp, lm(iqv ~ 1 + ses))
# obtain predictions Q and prediction variance U
predm <- lapply(getfit(fit), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df
# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
pred[i, 1] <- pi[["qbar"]]
pred[i, 2] <- sqrt(pi[["t"]])
pred[i, 3] <- pi[["df"]]
}
Perhaps we could evolve this into a predict.mira()
function.
FWIW, predm[[1]]
doesn't have a df object for glm, so dfcom ends up being NULL. It probably makes sense to get it from getfit(fit)[[1]]$df.null
assuming the null df is the correct one. Apologies if I am missing something obvious.
@markdanese Thanks for flagging this. The code is an example. It works for lm()
, but needs tweaking for other models.
Perhaps we could evolve this into a
predict.mira()
function.
Would really appreciate that implementation!