estimatr icon indicating copy to clipboard operation
estimatr copied to clipboard

estimatr produces results even when the design matrix is bad

Open vincentarelbundock opened this issue 4 years ago • 5 comments

I expected this regression to produce NA for the coefficient associated with x:

library(estimatr)
dat <- data.frame(y = rnorm(10000), x = 1)
lm_robust(y ~ x, dat)
#>                 Estimate  Std. Error    t value  Pr(>|t|)      CI Lower
#> (Intercept) -60995863852 63595979733 -0.9591151 0.3375240 -185656785195
#> x            60995863852 63523422972  0.9602106 0.3369725  -63522831636
#>                 CI Upper   DF
#> (Intercept)  63665057491 9998
#> x           185514559340 9998

Created on 2020-06-05 by the reprex package (v0.3.0)

Thoughts?

vincentarelbundock avatar Jun 06 '20 01:06 vincentarelbundock

Thanks for the issue. When we first wrote the lm solver to use the QR decompositions available in Eigen, we noticed that there were some differences to the QR decomposition available in base R when there is collinearity between covariates. In most cases, this simply causes slight discrepancies in which coefficients are dropped, but in other cases, like this one here, there is some undesirable properties of the estimator.

Unfortunately, there's not really a great way for us to catch this without adding some additional decomposition and I don't think we'll do that.

I think the only thing we can do is check the condition number of the QR decomposition and warn if it is too large (statsmodels for python does something like this), but in general I'm against too many warnings. Thoughts?

lukesonnet avatar Jun 09 '20 22:06 lukesonnet

It "does the right thing" for N=100, eg

lm_robust(y~x, data.frame(y = rnorm(100), x = 1))

You might consider making hooking up the threshold to an option() or fit parameter and/or setting the default as a fn of N rather than using the vendor default - see also https://eigen.tuxfamily.org/dox/classEigen_1_1ColPivHouseholderQR.html#ae712cdc9f0e521cfc8061bee58ff55ee

nfultz avatar Jun 09 '20 23:06 nfultz

The second option seems possible, Neal. I don't think passing through an option makes sense. If people notice they have garbage, they should address the issue rather than messing with tolerances.

lukesonnet avatar Jun 10 '20 00:06 lukesonnet

At least in the context of DeclareDesign, Bad Things can happen when the NAness of the output coefficients changes from draw to draw or as N increases. Maybe it could just have a more conservative setting instead of the vendor default. N of 10k is definitely reasonable in DD, although 1M is probably pushing it.

nfultz avatar Jun 10 '20 00:06 nfultz

I hear you, will inspect.

lukesonnet avatar Jun 11 '20 15:06 lukesonnet