mice
mice copied to clipboard
How should mice behave when variables are not specified in the model
test-blocks.R
contains a specification of the mice setup with two non-standard features.
- a duplicate
bmi
is acceptable throughblocks
specification - variable
hyp
is not specified
The current policy is not very satisfying. Currently, where[, "hyp"]
is set to FALSE, so hyp
is not imputed. However, it is still a predictor for blocks B1
, bmi
and age
, thus leading to missing data propagation.
Using c2da03c:
library(mice) # branch support_blocks
#>
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#>
#> filter
#> The following objects are masked from 'package:base':
#>
#> cbind, rbind
imp <- mice(nhanes, blocks = make.blocks(list(c("bmi", "chl"), "bmi", "age")), m = 1, print = FALSE)
head(complete(imp))
#> age bmi hyp chl
#> 1 1 NA NA NA
#> 2 2 22.7 1 187
#> 3 1 27.2 1 187
#> 4 3 NA NA NA
#> 5 1 20.4 1 113
#> 6 3 NA NA 184
imp$blocks
#> $B1
#> [1] "bmi" "chl"
#>
#> $bmi
#> [1] "bmi"
#>
#> $age
#> [1] "age"
#>
#> attr(,"calltype")
#> B1 bmi age
#> "formula" "formula" "formula"
imp$formulas
#> $B1
#> bmi + chl ~ age + hyp
#> <environment: 0x11e6e1750>
#>
#> $bmi
#> bmi ~ age + hyp + chl
#> <environment: 0x11e6e1750>
#>
#> $age
#> age ~ bmi + hyp + chl
#> <environment: 0x11e6e1750>
head(imp$where)
#> age bmi hyp chl
#> 1 FALSE TRUE FALSE TRUE
#> 2 FALSE FALSE FALSE FALSE
#> 3 FALSE TRUE FALSE FALSE
#> 4 FALSE TRUE FALSE TRUE
#> 5 FALSE FALSE FALSE FALSE
#> 6 FALSE TRUE FALSE FALSE
imp$method
#> B1 bmi age
#> "pmm" "pmm" ""
imp$predictorMatrix
#> age bmi hyp chl
#> age 0 0 0 0
#> bmi 1 0 1 1
#> hyp 1 1 0 1
#> chl 1 1 1 0
Created on 2023-09-13 with reprex v2.0.2
A better policy might be inactivating any unmentioned variable j
by
- set
method[j]
to""
(we can always do that sincej
is not mentioned in the model) - set
predictorMatrix[, j]
to0
(takej
out as predictor) - leave
predictorMatrix[j, ]
untouched (so we can still which variables it would require to imputed) - leave
where[, j]
untouched
As a result, j
is not imputed and is not a predictor anywhere. The policy might stimulate starting small (with a few variables, and gradually build up). Does this seem a good approach? Any downsides to it?