Revisions to prediction with lm_lin()

Open mollyow opened this issue 1 year ago • 1 comments

Currently prediction does not work for lm_lin() with multi-valued or factorial treatments. This is because of how predict.lm_robust() handles generation of the lin estimator model matrix with new data. The treatment name saved in the lm_lin() model object refers to the original variable name, which may have been transformed in the model matrix to multiple columns, causing some disagreement when treatment x covariate interactions are created. The original variable name doesn't exist in the revised model matrix, and/or the new data model matrix doesn't have correct dimensions when multiplied by coefficients.

See here.

For example:

library(estimatr)
set.seed(60637)

N <- 40
dat <- data.frame(
  x = rnorm(N, mean = 2.3),
  x2 = rpois(N, lambda = 2),
  x3 = runif(N)
)

dat$y0 <- rnorm(N) + dat$x
dat$y1 <- dat$y0 + 0.35
dat$y2 <- dat$y0 + 0.55

dat$z_multi <- sample(0:2, size = nrow(dat), replace = TRUE)
dat$z_bin <- 1*(dat$z_multi>0)
dat$y <- (dat$z_multi == 0)*dat$y0 + (dat$z_multi == 1)*dat$y1 + (dat$z_multi == 2)*dat$y2

# Multi-valued numeric treatments with lm_lin; estimation works as expected
lmlin_mult <- lm_lin(y ~ z_multi, covariates = ~ x, data = dat)
# prediction does not
predict(lmlin_mult, newdata = dat)
# Error in X[, !beta_na, drop = FALSE] :
#   (subscript) logical subscript too long

# Binary factorial treatment with lm_lin; estimation works,
lmlin_bin_f <- lm_lin(y ~ as.factor(z_bin), covariates = ~ x + x2 + x3, data = dat)
# prediction breaks
predict(lmlin_bin_f, newdata = dat)
# Error in X[, treat_name] : subscript out of bounds

More detail in gist here

A revision to handle setting up treatment columns in the new data could be implemented in get_X().

Jan 07 '25 00:01 mollyow

Also thank you all for making such a very useful package!

Jan 07 '25 02:01 mollyow