mice icon indicating copy to clipboard operation
mice copied to clipboard

Request: make predictions possible after using pool()

Open oren0e opened this issue 6 years ago • 28 comments

When using mice multiple imputation to generate multiple imputed sets and then running a model on each one of them (like elastic-net from the glmnet package) and then pooling the estimates to a single set - its not possible to use predict.glmnet() (or predict()) on the object that comes out of mice::pool(). The prediction functions expect a fitted model object while mice::pool() returns

An object of class mipo, which stands for 'multiple imputation pooled outcome'.

Is it possible to add the ability to make predictions from the return object of mice::pool()?

oren0e avatar Apr 08 '18 10:04 oren0e

There are actually multiple ways to create predictions from multiply imputed data. See Obtaining Predictions from Models Fit to Multiply Imputed Data for a comparison of two methods. I tend to deviate from Miles' recommendation, and would generally prefer Predict then combine because that corresponds to Rubin's rules and accounts for nonlinearities at the proper place.

Also see the discussion by Wood et al., who highlight the decisions to be made when accounting for model optimism.

It should be relatively straightforward to implement any of these approaches in mice.

stefvanbuuren avatar Apr 09 '18 08:04 stefvanbuuren

Just want to add that this feature would be quite useful. I'm a bit surprised by the recommendations in the Miles paper, it doesn't really show the regimes where combine and predict would break down substantially.

mnarayan avatar Mar 18 '19 16:03 mnarayan

Any word on this feature? Would be very helpful!

SolomonMg avatar Aug 21 '20 18:08 SolomonMg

I have done some thinking on mice prediction workflows. Here are two possibilities, one involving only a train datasets, another involving both a train and test dataset.

A

One data set: train, $n$ rows, $m$ imputations:

  • Impute the missing data;
  • Fit the complete-data model to each imputed data set;
  • Obtain predictions per imputed data set;
  • Pool predictions and prediction errors
  • Bind the result to train.
1. Standard mice: imp <- mice(train, …)
2. Standard complete-data analysis: fit <- with(imp, mymodel, …)
3. Calculate predictions: predlist <- lapply(fit$analysis, broom::augment.xxx, se.fit = TRUE, …) - We may need the data argument for some procedures. Perhaps use map().
4. Apply Rubin's rules to pool predictions: predmat <- pool(predlist, …), which results in a n * 3 matrix: .fitted, .residuals, .se.fit
5. Bind to data: ready <- bind_cols(train, predmat)
  • Supports all broom::augment.xxx() functions.
  • Steps 3 and 4 can jointly form an augment.mira() function, a kind of meta-augment.

B

Two data sets: train ($n_0$ rows) and test ($n_1$ rows), $m$ imputations.

  • Estimate the imputation model on train only;
  • Impute the missing data in both train and test;
  • Fit the complete-data model to each imputed data set using train only;
  • Obtain predictions per imputed data set in test only;
  • Pool predictions and prediction errors;
  • Bind the result to test.
1. Standard mice: imp.train <- mice(train, …)
2. Impute test data: imp.test <_ mice.mids(imp.train, newdata = test)
3. fit <- with(imp.test, mymodel, …)
4. Apply steps A3 and A4
5. Bind to data: ready <- bind_cols(test, predmat)

A and B are extreme scenarios, and obviously, we can mix aspects of both.

What I would like to know is whether A or B as described would be useful for applications. Suggestions for alternatives and tweaks are welcome.

stefvanbuuren avatar Oct 27 '20 08:10 stefvanbuuren

I came to request exactly option B. My application is to use MICE as part of a processing pipeline for predictive models, where the standard is to have a true hold-out set. In deployment I also don't want to update the estimation - just apply it (including the imputation model) to the example at hand. It's kind of complicated with predictive mean matching because (I think) the complete training dataset is needed at inference time to select the replacements at each step. Exporting the model to something light weight would be huge.

cryanking avatar Feb 15 '21 22:02 cryanking

I'd be very happy with A.

JonNDCN avatar May 28 '21 13:05 JonNDCN

Dear all, I'd like to know the current status of this issue. In fact, I only need the pooled predicted values, not their standard errors, so I suppose averaging the predictions would be enough. If mice does not provide that yet, would this script be sensible (13 imputations)?:

fit <- with(impsppb, glm(compo ~ edad+sexo+fumarw2,family = quasibinomial)) for (i in 1:13) { assign(paste0("fit",i), data.frame(fit[["analyses"]][[i]][["data"]][["hi1"]],fit[["analyses"]][[i]][["fitted.values"]])) assign(paste0("fit",i), names<-(get(paste0("fit",i)), c("hi1", "pred"))) } fittot<-Reduce(function(x,y) merge(x = x, y = y, by = "hi1"), list(fit1,fit2,fit3,fit4,fit5,fit6,fit7,fit8,fit9,fit10,fit11,fit12,fit13)) fittot$predef <- rowMeans(fittot[,c(2:14)], na.rm=TRUE) fittot <- fittot[c(1,15)]

Thank you.

angelrodriguez2020 avatar May 31 '21 17:05 angelrodriguez2020

Option A would be spectacular. It's exactly what I'd want to teach to my students.

ASKurz avatar Oct 20 '21 16:10 ASKurz

Another vote for Option A!! I came looking for exactly that.

stjeandenise avatar Nov 11 '21 23:11 stjeandenise

For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/

ASKurz avatar Nov 12 '21 01:11 ASKurz

For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/

🤗 you are my favorite person @ASKurz

stjeandenise avatar Nov 12 '21 02:11 stjeandenise

Cheers! 🍻

ASKurz avatar Nov 12 '21 02:11 ASKurz

Thank you so much @ASKurz, just what I needed

JonNDCN avatar Nov 12 '21 14:11 JonNDCN

The script below uses the @ASKurz example in a more micy way. It is essentially Rubin's rules applied to model predictions instead of model parameters.

library(mice)
library(dplyr)

# generate data
set.seed(201)
b_small <-
  brandsma %>%
  filter(!complete.cases(.)) %>%
  slice_sample(n = 50) %>%
  select(ses, iqv, iqp)

# impute & analyse as usual
imp <- mice(b_small, seed = 540, m = 10, method = "norm", print = FALSE)
fit <- with(imp, lm(iqv ~ 1 + ses))

# obtain predictions Q and prediction variance U
predm <- lapply(getfit(fit), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df

# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
                 dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
  pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
  pred[i, 1] <- pi[["qbar"]]
  pred[i, 2] <- sqrt(pi[["t"]])
  pred[i, 3] <- pi[["df"]]
}

Perhaps we could evolve this into a predict.mira() function.

stefvanbuuren avatar Nov 14 '21 12:11 stefvanbuuren

FWIW, predm[[1]] doesn't have a df object for glm, so dfcom ends up being NULL. It probably makes sense to get it from getfit(fit)[[1]]$df.null assuming the null df is the correct one. Apologies if I am missing something obvious.

markdanese avatar Jan 14 '22 20:01 markdanese

@markdanese Thanks for flagging this. The code is an example. It works for lm(), but needs tweaking for other models.

stefvanbuuren avatar Jan 15 '22 15:01 stefvanbuuren

Perhaps we could evolve this into a predict.mira() function.

Would really appreciate that implementation!

RStellato avatar Aug 29 '22 14:08 RStellato