mice Request: make predictions possible after using pool()

When using mice multiple imputation to generate multiple imputed sets and then running a model on each one of them (like elastic-net from the glmnet package) and then pooling the estimates to a single set - its not possible to use predict.glmnet() (or predict()) on the object that comes out of mice::pool(). The prediction functions expect a fitted model object while mice::pool() returns

An object of class mipo, which stands for 'multiple imputation pooled outcome'.

Is it possible to add the ability to make predictions from the return object of mice::pool()?

Apr 08 '18 10:04 oren0e

There are actually multiple ways to create predictions from multiply imputed data. See Obtaining Predictions from Models Fit to Multiply Imputed Data for a comparison of two methods. I tend to deviate from Miles' recommendation, and would generally prefer Predict then combine because that corresponds to Rubin's rules and accounts for nonlinearities at the proper place.

Also see the discussion by Wood et al., who highlight the decisions to be made when accounting for model optimism.

It should be relatively straightforward to implement any of these approaches in mice.

Apr 09 '18 08:04 stefvanbuuren

Just want to add that this feature would be quite useful. I'm a bit surprised by the recommendations in the Miles paper, it doesn't really show the regimes where combine and predict would break down substantially.

Mar 18 '19 16:03 mnarayan

Any word on this feature? Would be very helpful!

Aug 21 '20 18:08 SolomonMg

I have done some thinking on mice prediction workflows. Here are two possibilities, one involving only a train datasets, another involving both a train and test dataset.

A

One data set: train, $n$ rows, $m$ imputations:

Impute the missing data;
Fit the complete-data model to each imputed data set;
Obtain predictions per imputed data set;
Pool predictions and prediction errors
Bind the result to train.

1. Standard mice: imp <- mice(train, …)
2. Standard complete-data analysis: fit <- with(imp, mymodel, …)
3. Calculate predictions: predlist <- lapply(fit$analysis, broom::augment.xxx, se.fit = TRUE, …) - We may need the data argument for some procedures. Perhaps use map().
4. Apply Rubin's rules to pool predictions: predmat <- pool(predlist, …), which results in a n * 3 matrix: .fitted, .residuals, .se.fit
5. Bind to data: ready <- bind_cols(train, predmat)

Supports all broom::augment.xxx() functions.
Steps 3 and 4 can jointly form an augment.mira() function, a kind of meta-augment.

—

B

Two data sets: train ($n_0$ rows) and test ($n_1$ rows), $m$ imputations.

Estimate the imputation model on train only;
Impute the missing data in both train and test;
Fit the complete-data model to each imputed data set using train only;
Obtain predictions per imputed data set in test only;
Pool predictions and prediction errors;
Bind the result to test.

1. Standard mice: imp.train <- mice(train, …)
2. Impute test data: imp.test <_ mice.mids(imp.train, newdata = test)
3. fit <- with(imp.test, mymodel, …)
4. Apply steps A3 and A4
5. Bind to data: ready <- bind_cols(test, predmat)

—

A and B are extreme scenarios, and obviously, we can mix aspects of both.

What I would like to know is whether A or B as described would be useful for applications. Suggestions for alternatives and tweaks are welcome.

Oct 27 '20 08:10 stefvanbuuren

I came to request exactly option B. My application is to use MICE as part of a processing pipeline for predictive models, where the standard is to have a true hold-out set. In deployment I also don't want to update the estimation - just apply it (including the imputation model) to the example at hand. It's kind of complicated with predictive mean matching because (I think) the complete training dataset is needed at inference time to select the replacements at each step. Exporting the model to something light weight would be huge.

Feb 15 '21 22:02 cryanking

I'd be very happy with A.

May 28 '21 13:05 JonNDCN

Dear all, I'd like to know the current status of this issue. In fact, I only need the pooled predicted values, not their standard errors, so I suppose averaging the predictions would be enough. If mice does not provide that yet, would this script be sensible (13 imputations)?:

fit <- with(impsppb, glm(compo ~ edad+sexo+fumarw2,family = quasibinomial)) for (i in 1:13) { assign(paste0("fit",i), data.frame(fit[["analyses"]][[i]][["data"]][["hi1"]],fit[["analyses"]][[i]][["fitted.values"]])) assign(paste0("fit",i), names<-(get(paste0("fit",i)), c("hi1", "pred"))) } fittot<-Reduce(function(x,y) merge(x = x, y = y, by = "hi1"), list(fit1,fit2,fit3,fit4,fit5,fit6,fit7,fit8,fit9,fit10,fit11,fit12,fit13)) fittot$predef <- rowMeans(fittot[,c(2:14)], na.rm=TRUE) fittot <- fittot[c(1,15)]

Thank you.

May 31 '21 17:05 angelrodriguez2020

Option A would be spectacular. It's exactly what I'd want to teach to my students.

Oct 20 '21 16:10 ASKurz

Another vote for Option A!! I came looking for exactly that.

Nov 11 '21 23:11 stjeandenise

For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/

Nov 12 '21 01:11 ASKurz

For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/

🤗 you are my favorite person @ASKurz

Nov 12 '21 02:11 stjeandenise

Cheers! 🍻

Nov 12 '21 02:11 ASKurz

Thank you so much @ASKurz, just what I needed

Nov 12 '21 14:11 JonNDCN

The script below uses the @ASKurz example in a more micy way. It is essentially Rubin's rules applied to model predictions instead of model parameters.

library(mice)
library(dplyr)

# generate data
set.seed(201)
b_small <-
  brandsma %>%
  filter(!complete.cases(.)) %>%
  slice_sample(n = 50) %>%
  select(ses, iqv, iqp)

# impute & analyse as usual
imp <- mice(b_small, seed = 540, m = 10, method = "norm", print = FALSE)
fit <- with(imp, lm(iqv ~ 1 + ses))

# obtain predictions Q and prediction variance U
predm <- lapply(getfit(fit), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df

# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
                 dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
  pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
  pred[i, 1] <- pi[["qbar"]]
  pred[i, 2] <- sqrt(pi[["t"]])
  pred[i, 3] <- pi[["df"]]
}

Perhaps we could evolve this into a predict.mira() function.

Nov 14 '21 12:11 stefvanbuuren

FWIW, predm[[1]] doesn't have a df object for glm, so dfcom ends up being NULL. It probably makes sense to get it from getfit(fit)[[1]]$df.null assuming the null df is the correct one. Apologies if I am missing something obvious.

Jan 14 '22 20:01 markdanese

@markdanese Thanks for flagging this. The code is an example. It works for lm(), but needs tweaking for other models.

Jan 15 '22 15:01 stefvanbuuren

Perhaps we could evolve this into a predict.mira() function.

Would really appreciate that implementation!

Aug 29 '22 14:08 RStellato

mice mice copied to clipboard

Request: make predictions possible after using pool()

A

B

mice
mice copied to clipboard