estimatr
estimatr copied to clipboard
estimatr produces results even when the design matrix is bad
I expected this regression to produce NA for the coefficient associated with x
:
library(estimatr)
dat <- data.frame(y = rnorm(10000), x = 1)
lm_robust(y ~ x, dat)
#> Estimate Std. Error t value Pr(>|t|) CI Lower
#> (Intercept) -60995863852 63595979733 -0.9591151 0.3375240 -185656785195
#> x 60995863852 63523422972 0.9602106 0.3369725 -63522831636
#> CI Upper DF
#> (Intercept) 63665057491 9998
#> x 185514559340 9998
Created on 2020-06-05 by the reprex package (v0.3.0)
Thoughts?
Thanks for the issue. When we first wrote the lm solver to use the QR decompositions available in Eigen, we noticed that there were some differences to the QR decomposition available in base R when there is collinearity between covariates. In most cases, this simply causes slight discrepancies in which coefficients are dropped, but in other cases, like this one here, there is some undesirable properties of the estimator.
Unfortunately, there's not really a great way for us to catch this without adding some additional decomposition and I don't think we'll do that.
I think the only thing we can do is check the condition number of the QR decomposition and warn if it is too large (statsmodels
for python does something like this), but in general I'm against too many warnings. Thoughts?
It "does the right thing" for N=100, eg
lm_robust(y~x, data.frame(y = rnorm(100), x = 1))
You might consider making hooking up the threshold to an option() or fit parameter and/or setting the default as a fn of N rather than using the vendor default - see also https://eigen.tuxfamily.org/dox/classEigen_1_1ColPivHouseholderQR.html#ae712cdc9f0e521cfc8061bee58ff55ee
The second option seems possible, Neal. I don't think passing through an option makes sense. If people notice they have garbage, they should address the issue rather than messing with tolerances.
At least in the context of DeclareDesign, Bad Things can happen when the NAness of the output coefficients changes from draw to draw or as N increases. Maybe it could just have a more conservative setting instead of the vendor default. N of 10k is definitely reasonable in DD, although 1M is probably pushing it.
I hear you, will inspect.