mice icon indicating copy to clipboard operation
mice copied to clipboard

remove.lindep() can lead to inconsistent predictorMatrix printing in the mids object

Open gerkovink opened this issue 3 years ago • 5 comments

There seems to be a reporting inconsistency when the predictorMatrix after imputation seems to indicate that a variable is included as a predictor, while remove.lindep() has effectively removed it. This leads to biased results due to an uncongenial imputation model, but may lull the user into thinking that the model is correctly specified. See below, where imp1 is wrong and imp2 is correct, but they have the same predictorMatrix printed.

library(mice, warn.conflicts = FALSE)
library(dplyr) # data manipulation
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(MASS) # for mvrnorm()
#> 
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#> 
#>     select
set.seed(123)
data <- mvrnorm(2000, mu = c(1, 2), Sigma = matrix(c(4, 4.8, 4.8, 9), 2, 2))
x <- data[, 1]
y <- data[, 2]
data <- data.frame(
  x = x,
  y = y,
  z = y 
)

# true parameters
lm(y ~ x + z, data = data)
#> 
#> Call:
#> lm(formula = y ~ x + z, data = data)
#> 
#> Coefficients:
#> (Intercept)            x            z  
#>   2.302e-15    3.804e-15    1.000e+00

# make data incomplete cf. MNAR based on z (which is copy of y)
incomplete.data <- ampute(data, 
                          prop = 0.7, 
                          mech = "MAR", 
                          type = "RIGHT", 
                          patterns = c(1, 0, 1), 
                          weights = c(0, 0, 1))$amp

# imputation without collinear
imp <- mice(incomplete.data, 
            ls.meth = "qr", 
            meth = "norm", 
            print = FALSE)
#> Warning: Number of logged events: 1
imp$pred
#>   x y z
#> x 0 0 1
#> y 0 0 0
#> z 1 0 0
with(imp, lm(y ~ x + z)) %>% pool %>% summary
#>          term      estimate    std.error     statistic       df    p.value
#> 1 (Intercept) -1.766515e-16 1.005331e-16 -1.757149e+00 562.9535 0.07943602
#> 2           x -1.401097e-15 8.113521e-17 -1.726867e+01 562.9535 0.00000000
#> 3           z  1.000000e+00 5.678926e-17  1.760896e+16 562.9535 0.00000000

# keep all collinear columns
imp1 <- mice(incomplete.data, 
            remove.collinear = "FALSE", 
            threshold = 1.1, 
            ls.meth = "qr", 
            meth = "norm", 
            print = FALSE)
imp1$pred
#>   x y z
#> x 0 1 1
#> y 1 0 1
#> z 1 1 0
with(imp1, lm(y ~ x + z)) %>% pool %>%  summary
#>          term    estimate  std.error  statistic         df      p.value
#> 1 (Intercept) -0.08736438 0.11818690 -0.7392053   5.495134 4.901374e-01
#> 2           x  0.76055988 0.04452898 17.0801106  17.205011 3.159029e-12
#> 3           z  0.28608590 0.02302393 12.4255868 226.274361 0.000000e+00

# bypass lindep
imp2 <- mice(incomplete.data, 
            remove.collinear = "FALSE", 
            threshold = 1.1, 
            ls.meth = "qr", 
            meth = "norm", 
            print = FALSE, 
            eps = 0)
imp2$pred
#>   x y z
#> x 0 1 1
#> y 1 0 1
#> z 1 1 0
with(imp2, lm(y ~ x + z)) %>% pool %>% summary
#>          term     estimate    std.error    statistic       df     p.value
#> 1 (Intercept) 1.408743e-15 8.607085e-16 1.636725e+00 4.091456 0.175436545
#> 2           x 2.853085e-15 4.895656e-16 5.827789e+00 4.322090 0.003384635
#> 3           z 1.000000e+00 3.420807e-16 2.923286e+15 4.267820 0.000000000

Created on 2021-01-21 by the reprex package (v0.3.0)

gerkovink avatar Jan 21 '21 09:01 gerkovink

Thanks for pointing out this problem.

As far as I can see, there is a unwanted interaction between the regularisation techniques you've implemented in mice.impute.norm() and the separate internal function remove.lindep(). The latter "wins" because it runs before mice.impute.norm(), so mice.impute.norm() never gets to see the problematic data.

I once created remove.lindep() as a hack because mice() frequently produced error messages like Lapack routine dgesv: system is exactly singular, Error in solve.default(xtx + diag(pen)) and similar. The code of remove.lindep() is the worst part of mice (slow and naive), but it is being run for every variable at every iteration. If you have many variables, remove.lindep() may easily take 90% of the processor time used by mice().

So let's make a plan to retire remove.lindep().

I did an experiment by setting eps = 0 as default, and rebuilt the package. This introduces a few new error messages, but they seems all solvable. We also need to introduce additional tests for less-frequently used imputation methods, and investigate how they behave under duplicate/highly correlated variables.

stefvanbuuren avatar Apr 01 '21 05:04 stefvanbuuren

I'm not sure where this particular issue is on the roadmap, but I am working on some code to do high-dimensional mass imputation using a reimplementation of mice.impute.cart() built on the ranger package instead of rpart. That function zips, but I've done some profiling and remove.lindep() is just a huge computational drain that's basically irrelevant to those kinds of models. Passing in eps = 0 helps a lot, but even with that, the following x <- x[keep, , drop = FALSE] is nontrivial with large datasets.

I realize my scenario might be an edge case, and I can hack around this with a fork or maybe replacing sampler.univ() with my own version at runtime. But having come across this issue, I figured I'd put in a plug for ditching remove.lindep() and see if there were any plans to scrap it in any upcoming releases.

awmercer avatar Nov 11 '22 23:11 awmercer

Yes, retiring remove.lindep() is still on the agenda, but as yet there is no a generic remover() function. The remover should be able to

  • work both at initialisation and during the iterations
  • distinguish between constants and linearly dependent/duplicate variables
  • produce proper edits of affected model settings
  • have a simple and consistent user API
  • and above all, be fast.

The implementation will need to be integrated well into various parts in mice, and eliminate the need for esoteric and ad-hoc flags like remove.collinear, eps, threshold, ridge and the like.

Adding a remover is separate from upgrading mice.impute.cart() with ranger for speed. I would welcome a pull request along the same lines as the updragde for ranger in mice.impute.rf(). I would like to see evidence that ranger has the same statistical properties as cart, and does not cut corners in the method.

stefvanbuuren avatar Nov 13 '22 11:11 stefvanbuuren

I'll have to look at how the ranger option in mice.impute.rf() is implemented. If ranger has already been added to mice.impute.rf(), the functionality might already exist and it could just be a matter of supplying the right arguments to ranger to make it fit a single tree with behavior similar to rpart.

ranger's algorithm isn't exactly the same as rpart for a single tree (e.g. there is no cp parameter), and so far in my own testing, I've found it to be a little more variable than mice.impute.cart but much faster. I'm still testing, but once I'm satisfied that everything works as intended, I'd be happy to submit a pull request.

awmercer avatar Nov 13 '22 16:11 awmercer