estimatr
estimatr copied to clipboard
add warning for collinearity
After running a panel model (5475 units, 13 years) with stats::lm(), I've been trying to estimate heteroskedasticity robust standard errors with estimatr::lm_robust(). It turns out that some coefficients estimated are (very) different in each function - which shouldn't happen (only se, t.test and p-values should change).
df <- readRDS(url("https://github.com/victorrssx/victorrssx/raw/main/df.RDS")) %>%
dplyr::mutate(ano = as.factor(ano))
lm_mod <- stats::lm(tx_hom_tot ~ pos2018 + res_ou:pos2018 + ti:pos2018 + res_ou:ti:pos2018 + ano, data = df)
summary(lm_mod)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.827 0.382 70.25 < 0.0000000000000002 ***
pos2018 2.927 0.538 5.44 0.00000005226909 ***
ano2011 0.220 0.540 0.41 0.68385
ano2012 1.182 0.536 2.20 0.02754 *
ano2013 0.892 0.537 1.66 0.09640 .
ano2014 1.960 0.534 3.67 0.00025 ***
ano2015 2.985 0.533 5.60 0.00000002131005 ***
ano2016 5.556 0.530 10.48 < 0.0000000000000002 ***
ano2017 6.835 0.528 12.95 < 0.0000000000000002 ***
ano2018 4.841 0.531 9.12 < 0.0000000000000002 ***
ano2019 -1.633 0.520 -3.14 0.00169 **
ano2020 0.280 0.518 0.54 0.58944
ano2021 -0.272 0.517 -0.53 0.59925
ano2022 NA NA NA NA
pos2018:res_ou 0.862 0.643 1.34 0.18005
pos2018:ti 2.874 0.629 4.57 0.00000498294524 ***
pos2018:res_ou:ti 9.824 1.365 7.20 0.00000000000063 ***
lm_mod <- estimatr::lm_robust(tx_hom_tot ~ pos2018 + res_ou:pos2018 + ti:pos2018 + res_ou:ti:pos2018 + ano, data = df, se_type = "HC1")
summary(lm_mod)
Standard error type: HC1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.930 10407.875 0.00258749 0.9979355
pos2018 2842202543499.796 324275566161917504.000 0.00000876 0.9999930
ano2011 0.222 196.496 0.00112810 0.9990999
ano2012 1.179 330.843 0.00356436 0.9971561
ano2013 0.890 397.685 0.00223694 0.9982152
ano2014 1.957 125.512 0.01558929 0.9875621
ano2015 2.982 396.852 0.00751471 0.9940042
ano2016 5.553 470.073 0.01181333 0.9905746
ano2017 6.832 355.406 0.01922433 0.9846622
ano2018 4.838 355.006 0.01362761 0.9891271
ano2019 -2842202543498.518 264769891052461504.000 -0.00001073 0.9999914
ano2020 -2842202543497.091 374441170828414656.000 -0.00000759 0.9999939
ano2021 -2842202543497.189 324275566161908160.000 -0.00000876 0.9999930
ano2022 -2842202543496.979 324275566161925760.000 -0.00000876 0.9999930
pos2018:res_ou 0.863 5.743 0.15017708 0.8806255
pos2018:ti 2.874 0.677 4.24755884 0.0000216
pos2018:res_ou:ti 9.852 2739.434 0.00359654 0.9971304
Specifically, the problem happens sharply on pos2018, ano2019, ano2020, ano2021 which are all dummies. The first is set to 1 for years after 2018 and 0 otherwise; the remainder are year-specific dummies. Even setting fixed_effects = ~ ano produce strange results.
lm_mod <- estimatr::lm_robust(tx_hom_tot ~ pos2018 + res_ou:pos2018 + ti:pos2018 + res_ou:ti:pos2018, data = df, se_type = "HC1", fixed_effects = ~ ano)
summary(lm_mod)
Standard error type: HC1
Coefficients: (1 not defined because the design matrix is rank deficient)
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
pos2018 NA NA NA NA NA NA NA
pos2018:res_ou 0.862 0.707 1.22 0.22251985 -0.523 2.25 50693
pos2018:ti 2.874 0.640 4.49 0.00000709 1.620 4.13 50693
pos2018:res_ou:ti 9.824 2.252 4.36 0.00001292 5.409 14.24 50693
When using lmtest::coeftest(lm_mod, vcov. = plm::vcovHC(lm_mod, type = "HC1")) it seems to produce correct results. So why is this happening with lm_robust? Could it be due to the interpretation of the NAs in the data by the function?
Notice that lm is dropping 2022, probably due to colinearity in your data. I would try removing it manually from the model and seeing how lm_robust does.
yes -- pos2018 is collinear with ano and ano is collinear with the constant you cannot estimate a coefficient for pos2018 when you include ano fixed effects the different coefficients reflect different ways of addressing collinearity
the fixed effect results you get look good; note pos2018 is still (correctly) dropped
note also that the interaction terms are all the same and these do not depend on this collinearity!
confirm when you drop pos2018
lm_mod1 <- stats::lm(tx_hom_tot ~ res_ou:pos2018 + ti:pos2018 + res_ou:ti:pos2018 + ano+ , data = df)> > > lm_mod2 <- estimatr::lm_robust(tx_hom_tot ~ res_ou:pos2018 + ti:pos2018 + res_ou:ti:pos2018 + ano+ , data = df,+ se_type = "HC1")> cbind(coef(lm_mod1) , coef(lm_mod2)) [,1] [,2] (Intercept) 26.8272932 26.8272932 ano2011 0.2199035 0.2199035 ano2012 1.1821015 1.1821015 ano2013 0.8924856 0.8924856 ano2014 1.9595268 1.9595268 ano2015 2.9851035 2.9851035 ano2016 5.5560158 5.5560158 ano2017 6.8353214 6.8353214 ano2018 4.8407606 4.8407606 ano2019 1.2935977 1.2935977 ano2020 3.2067074 3.2067074 ano2021 2.6553411 2.6553411 ano2022 2.9270375 2.9270375 res_ou:pos2018 0.8624844 0.8624844 pos2018:ti 2.8738130 2.8738130 res_ou:pos2018:ti 9.8239281 9.8239281
On Tue, Jun 25, 2024 at 6:00 PM Neal Fultz @.***> wrote:
Notice that lm is dropping 2022, probably due to colinearity in your data. I would try removing it manually from the model and seeing how lm_robust does.
— Reply to this email directly, view it on GitHub https://github.com/DeclareDesign/estimatr/issues/411#issuecomment-2189344813, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBE57PZWXGX4CNSAGNW47LZJGHZVAVCNFSM6AAAAABJ4EWVASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGM2DIOBRGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@nfultz @macartan If I get it correctly I think you guys are right. Since pos2018 is defined as 1 after 2018 and 0 otherwise it can be seen as a linear combination of year dummies from 2019 to 2022. Removing pos2018 effectively return equal estimated coefficients apart the package used. I tried also with fixest::feols:
stats::lm fixest::feols estimatr::lm_robust
(Intercept) 26.8272932 26.8272932 26.8272932
ano2011 0.2199035 0.2199035 0.2199035
ano2012 1.1821015 1.1821014 1.1821015
ano2013 0.8924856 0.8924856 0.8924856
ano2014 1.9595268 1.9595268 1.9595268
ano2015 2.9851035 2.9851034 2.9851035
ano2016 5.5560158 5.5560158 5.5560158
ano2017 6.8353214 6.8353214 6.8353214
ano2018 4.8407606 4.8407606 4.8407606
ano2019 1.2935977 1.2935976 1.2935977
ano2020 3.2067074 3.2067073 3.2067074
ano2021 2.6553411 2.6553411 2.6553411
ano2022 2.9270375 2.9270374 2.9270375
res_ou:pos2018 0.8624844 -25.9648088 0.8624844
pos2018:ti 2.8738130 2.8738130 2.8738130
res_ou:pos2018:ti 9.8239281 -17.0033651 9.8239281
Note that fixest::feols returns a warning message when problems like these occur.
The variable 'pos2018' has been removed because of collinearity (see $collin.var).
Perhaps a message like this could be a future improvement due to cases where you accidentally construct new variables that are collinear with the original ones (and you don't notice at first glance). Thank you!