mice
mice copied to clipboard
remove.lindep() can lead to inconsistent predictorMatrix printing in the mids object
There seems to be a reporting inconsistency when the predictorMatrix
after imputation seems to indicate that a variable is included as a predictor, while remove.lindep()
has effectively removed it. This leads to biased results due to an uncongenial imputation model, but may lull the user into thinking that the model is correctly specified. See below, where imp1
is wrong and imp2
is correct, but they have the same predictorMatrix
printed.
library(mice, warn.conflicts = FALSE)
library(dplyr) # data manipulation
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(MASS) # for mvrnorm()
#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#> select
set.seed(123)
data <- mvrnorm(2000, mu = c(1, 2), Sigma = matrix(c(4, 4.8, 4.8, 9), 2, 2))
x <- data[, 1]
y <- data[, 2]
data <- data.frame(
x = x,
y = y,
z = y
)
# true parameters
lm(y ~ x + z, data = data)
#>
#> Call:
#> lm(formula = y ~ x + z, data = data)
#>
#> Coefficients:
#> (Intercept) x z
#> 2.302e-15 3.804e-15 1.000e+00
# make data incomplete cf. MNAR based on z (which is copy of y)
incomplete.data <- ampute(data,
prop = 0.7,
mech = "MAR",
type = "RIGHT",
patterns = c(1, 0, 1),
weights = c(0, 0, 1))$amp
# imputation without collinear
imp <- mice(incomplete.data,
ls.meth = "qr",
meth = "norm",
print = FALSE)
#> Warning: Number of logged events: 1
imp$pred
#> x y z
#> x 0 0 1
#> y 0 0 0
#> z 1 0 0
with(imp, lm(y ~ x + z)) %>% pool %>% summary
#> term estimate std.error statistic df p.value
#> 1 (Intercept) -1.766515e-16 1.005331e-16 -1.757149e+00 562.9535 0.07943602
#> 2 x -1.401097e-15 8.113521e-17 -1.726867e+01 562.9535 0.00000000
#> 3 z 1.000000e+00 5.678926e-17 1.760896e+16 562.9535 0.00000000
# keep all collinear columns
imp1 <- mice(incomplete.data,
remove.collinear = "FALSE",
threshold = 1.1,
ls.meth = "qr",
meth = "norm",
print = FALSE)
imp1$pred
#> x y z
#> x 0 1 1
#> y 1 0 1
#> z 1 1 0
with(imp1, lm(y ~ x + z)) %>% pool %>% summary
#> term estimate std.error statistic df p.value
#> 1 (Intercept) -0.08736438 0.11818690 -0.7392053 5.495134 4.901374e-01
#> 2 x 0.76055988 0.04452898 17.0801106 17.205011 3.159029e-12
#> 3 z 0.28608590 0.02302393 12.4255868 226.274361 0.000000e+00
# bypass lindep
imp2 <- mice(incomplete.data,
remove.collinear = "FALSE",
threshold = 1.1,
ls.meth = "qr",
meth = "norm",
print = FALSE,
eps = 0)
imp2$pred
#> x y z
#> x 0 1 1
#> y 1 0 1
#> z 1 1 0
with(imp2, lm(y ~ x + z)) %>% pool %>% summary
#> term estimate std.error statistic df p.value
#> 1 (Intercept) 1.408743e-15 8.607085e-16 1.636725e+00 4.091456 0.175436545
#> 2 x 2.853085e-15 4.895656e-16 5.827789e+00 4.322090 0.003384635
#> 3 z 1.000000e+00 3.420807e-16 2.923286e+15 4.267820 0.000000000
Created on 2021-01-21 by the reprex package (v0.3.0)
Thanks for pointing out this problem.
As far as I can see, there is a unwanted interaction between the regularisation techniques you've implemented in mice.impute.norm()
and the separate internal function remove.lindep()
. The latter "wins" because it runs before mice.impute.norm()
, so mice.impute.norm()
never gets to see the problematic data.
I once created remove.lindep()
as a hack because mice()
frequently produced error messages like Lapack routine dgesv: system is exactly singular
, Error in solve.default(xtx + diag(pen))
and similar. The code of remove.lindep()
is the worst part of mice
(slow and naive), but it is being run for every variable at every iteration. If you have many variables, remove.lindep()
may easily take 90% of the processor time used by mice()
.
So let's make a plan to retire remove.lindep()
.
I did an experiment by setting eps = 0
as default, and rebuilt the package. This introduces a few new error messages, but they seems all solvable. We also need to introduce additional tests for less-frequently used imputation methods, and investigate how they behave under duplicate/highly correlated variables.
I'm not sure where this particular issue is on the roadmap, but I am working on some code to do high-dimensional mass imputation using a reimplementation of mice.impute.cart()
built on the ranger package instead of rpart. That function zips, but I've done some profiling and remove.lindep()
is just a huge computational drain that's basically irrelevant to those kinds of models. Passing in eps = 0
helps a lot, but even with that, the following x <- x[keep, , drop = FALSE]
is nontrivial with large datasets.
I realize my scenario might be an edge case, and I can hack around this with a fork or maybe replacing sampler.univ()
with my own version at runtime. But having come across this issue, I figured I'd put in a plug for ditching remove.lindep()
and see if there were any plans to scrap it in any upcoming releases.
Yes, retiring remove.lindep()
is still on the agenda, but as yet there is no a generic remover()
function. The remover should be able to
- work both at initialisation and during the iterations
- distinguish between constants and linearly dependent/duplicate variables
- produce proper edits of affected model settings
- have a simple and consistent user API
- and above all, be fast.
The implementation will need to be integrated well into various parts in mice
, and eliminate the need for esoteric and ad-hoc flags like remove.collinear
, eps
, threshold
, ridge
and the like.
Adding a remover is separate from upgrading mice.impute.cart()
with ranger
for speed. I would welcome a pull request along the same lines as the updragde for ranger
in mice.impute.rf()
. I would like to see evidence that ranger
has the same statistical properties as cart
, and does not cut corners in the method.
I'll have to look at how the ranger option in mice.impute.rf()
is implemented. If ranger
has already been added to mice.impute.rf()
, the functionality might already exist and it could just be a matter of supplying the right arguments to ranger to make it fit a single tree with behavior similar to rpart
.
ranger
's algorithm isn't exactly the same as rpart
for a single tree (e.g. there is no cp
parameter), and so far in my own testing, I've found it to be a little more variable than mice.impute.cart
but much faster. I'm still testing, but once I'm satisfied that everything works as intended, I'd be happy to submit a pull request.