estimatr estimatr produces results even when the design matrix is bad

I expected this regression to produce NA for the coefficient associated with x:

library(estimatr)
dat <- data.frame(y = rnorm(10000), x = 1)
lm_robust(y ~ x, dat)
#>                 Estimate  Std. Error    t value  Pr(>|t|)      CI Lower
#> (Intercept) -60995863852 63595979733 -0.9591151 0.3375240 -185656785195
#> x            60995863852 63523422972  0.9602106 0.3369725  -63522831636
#>                 CI Upper   DF
#> (Intercept)  63665057491 9998
#> x           185514559340 9998

^{Created on 2020-06-05 by the reprex package (v0.3.0)}

Thoughts?

Jun 06 '20 01:06 vincentarelbundock

Thanks for the issue. When we first wrote the lm solver to use the QR decompositions available in Eigen, we noticed that there were some differences to the QR decomposition available in base R when there is collinearity between covariates. In most cases, this simply causes slight discrepancies in which coefficients are dropped, but in other cases, like this one here, there is some undesirable properties of the estimator.

Unfortunately, there's not really a great way for us to catch this without adding some additional decomposition and I don't think we'll do that.

I think the only thing we can do is check the condition number of the QR decomposition and warn if it is too large (statsmodels for python does something like this), but in general I'm against too many warnings. Thoughts?

Jun 09 '20 22:06 lukesonnet

It "does the right thing" for N=100, eg

lm_robust(y~x, data.frame(y = rnorm(100), x = 1))

You might consider making hooking up the threshold to an option() or fit parameter and/or setting the default as a fn of N rather than using the vendor default - see also https://eigen.tuxfamily.org/dox/classEigen_1_1ColPivHouseholderQR.html#ae712cdc9f0e521cfc8061bee58ff55ee

Jun 09 '20 23:06 nfultz

The second option seems possible, Neal. I don't think passing through an option makes sense. If people notice they have garbage, they should address the issue rather than messing with tolerances.

Jun 10 '20 00:06 lukesonnet

At least in the context of DeclareDesign, Bad Things can happen when the NAness of the output coefficients changes from draw to draw or as N increases. Maybe it could just have a more conservative setting instead of the vendor default. N of 10k is definitely reasonable in DD, although 1M is probably pushing it.

Jun 10 '20 00:06 nfultz

I hear you, will inspect.

Jun 11 '20 15:06 lukesonnet