miceforest
miceforest copied to clipboard
Predictions from machine learning models may not properly account for uncertainty
Hi all,
I was just reflecting on miceforest's imputation strategy, and I wonder if it may underestimate uncertainty in missing values.
In classical multiple imputation methods that use a Gaussian linear model, a prediction for the missing value is generated by (1) drawing a value beta_dot from the posterior distribution of regression coefficients, and a value sigma_dot from the posterior distribution of the observation noise, and then (2) drawing a prediction from N(X*beta_dot, sigma_dot). So the predictions that are generated account for noise in the data, and also account for uncertainty in the model (the uncertainty in estimated regression coefficients and in estimated noise), assuming of course that such linear model fits good enough. The idea is not to generate the 'best' prediction for a missing value, but to draw predictions from a distribution that reflects the uncertainty in missing values. These predictions can subsequently be used for predictive mean matching. This is based on https://stefvanbuuren.name/fimd/how-to-generate-multiple-imputations.html#sec:meth3
When one uses random forest to make predictions, for example, the output is a single value, which is the average of the outputs of B trees, each fitted to a bootstrap sample and each using a different subset of predictors. If the trees are fitted to new bootstrap samples at each iteration, the predictions they make should account for 'uncertainty in model parameters', as different samples would lead to different trees. However, as the random forest ultimately returns a single value, it does not provide an estimate for how uncertain the missing value may be. A naive solution could be to draw a random sample of K trees from all fitted trees, and use the average value of these trees as the prediction. But I do not know if this would account for uncertainty in a 'proper' way, as when K=1 there is probably too much uncertainty because individual trees can strongly overfit, and when K=B there is too little uncertainty. Looks like there's some work done on this in the literature (e.g. https://arxiv.org/abs/1404.6473), but having a very quick look I didn't spot something that is straightforward to implement.
So overall, when the imputation model is a machine learning model that returns a single 'best' prediction, rather than an interval estimate for where the predicted value may lie, then the imputed values that are generated may vary too little, even if predictive mean matching is used subsequently. I do not know atm how big consequence this may have for downstream analyses. I wonder if you have considered this potential issue?
Thanks, Andres
Thank you for this issue, it is important to talk about. I have given this quite a bit of thought, although I haven't written anything on it. Short answer: yes, miceforest
with default parameters will probably tend to impute values with less variance than the Gaussian model method specified Stef Van Buuren's book. When first designing miceforest
(actually, the miceRanger
R package was first), I planned on copying the missForest
R package method, just updating the underlying model to make the whole thing much faster. missForest
doesn't use mean matching. I discovered that this led to undesirable results, mainly that the imputations were too predictable, not noisy enough, and that an advanced model could pick up on imputed values, such as floats imputed in an integer field.
So I decided to implement mean matching. This worked very well. I performed tests on data with missing values purposely amputed so I could thoroughly investigate the results. Imputation values were accurate, MAR data was imputed with a distribution noticeably different from the nonmissing distribution, and most importantly of all, the variance of imputed values looked appropriate, given the information density in the dataset. Results looked good, so I decided to keep this formula.
Forget about model uncertainty for a second. Although it is intuitive to think in terms of model errors, the role that the lightgbm model plays in predictive mean matching is not to provide us with uncertainty. The role of the model is to provide us with a neighbor metric, the prediction. We aren't drawing from a probability distribution with well defined uncertainty, we are finding nearest neighbors, and imputing with the real value. Essentially, boiling an N dimensional dataset down into a 1 dimensional summary of each column.
This nearest neighbor metric, the predictions from the model, can be thought of as a noisy lookup. We can add noise to the lookup by underfitting, in which case we are saying "impute with values from samples which behave sort of like this one". We can remove noise from the lookup by overfitting, in which case we are saying "impute with values from samples which have features most similar to this sample". This gives us a range of "uncertainty" which is tunable by the user:
- Completely underfit model, outputs random predictions = Imputing values randomly from nonmissing values. This results in maximum uncertainty.
- Completely overfit model, outputs high variance predictions = Imputing values similar to nearest neighbors, based on feature similarity. With
mean matching candidates = 1
, this results in minimal uncertainty.
Model parameter regularization can be intermixed with the mean matching candidates
parameter, to add even more uncertainty. If mmc
is equal to the number of nonmissing samples, we are randomly imputing. If mmc = 1
, we are imputing with the closest neighbor, utilizing the lightgbm model fully.
So, the statistically robust answer is in there somewhere. I would be very interested in a study which analyzes exactly where miceforest falls in that spectrum using different parameter sets. I don't really have an interest in writing a white paper or quantifying this, my attention is currently devoted to other projects. I'll leave this issue open for discussion.
I came across this study that suggested there was both bias in values and narrowness in uncertainty in miceRanger. It would be useful if you were able to add this to your benchmarking. https://studenttheses.uu.nl/bitstream/handle/20.500.12932/42449/Elviss_Dvinskis_2459302_ADS_Thesis.pdf?sequence=1&isAllowed=y