estimatr icon indicating copy to clipboard operation
estimatr copied to clipboard

add warning for collinearity

Open victorrssx opened this issue 1 year ago • 3 comments

After running a panel model (5475 units, 13 years) with stats::lm(), I've been trying to estimate heteroskedasticity robust standard errors with estimatr::lm_robust(). It turns out that some coefficients estimated are (very) different in each function - which shouldn't happen (only se, t.test and p-values should change).

df <- readRDS(url("https://github.com/victorrssx/victorrssx/raw/main/df.RDS")) %>% 
      dplyr::mutate(ano = as.factor(ano))


lm_mod <- stats::lm(tx_hom_tot ~ pos2018 + res_ou:pos2018 + ti:pos2018 + res_ou:ti:pos2018 + ano, data = df)
summary(lm_mod)

    Coefficients: (1 not defined because of singularities)
                      Estimate Std. Error t value             Pr(>|t|)    
    (Intercept)         26.827      0.382   70.25 < 0.0000000000000002 ***
    pos2018              2.927      0.538    5.44     0.00000005226909 ***
    ano2011              0.220      0.540    0.41              0.68385    
    ano2012              1.182      0.536    2.20              0.02754 *  
    ano2013              0.892      0.537    1.66              0.09640 .  
    ano2014              1.960      0.534    3.67              0.00025 ***
    ano2015              2.985      0.533    5.60     0.00000002131005 ***
    ano2016              5.556      0.530   10.48 < 0.0000000000000002 ***
    ano2017              6.835      0.528   12.95 < 0.0000000000000002 ***
    ano2018              4.841      0.531    9.12 < 0.0000000000000002 ***
    ano2019             -1.633      0.520   -3.14              0.00169 ** 
    ano2020              0.280      0.518    0.54              0.58944    
    ano2021             -0.272      0.517   -0.53              0.59925    
    ano2022                 NA         NA      NA                   NA    
    pos2018:res_ou       0.862      0.643    1.34              0.18005    
    pos2018:ti           2.874      0.629    4.57     0.00000498294524 ***
    pos2018:res_ou:ti    9.824      1.365    7.20     0.00000000000063 ***
            
lm_mod <- estimatr::lm_robust(tx_hom_tot ~ pos2018 + res_ou:pos2018 + ti:pos2018 + res_ou:ti:pos2018 + ano, data = df, se_type = "HC1")
summary(lm_mod)

Standard error type:  HC1 

Coefficients:
                            Estimate             Std. Error     t value  Pr(>|t|)
(Intercept)                   26.930              10407.875  0.00258749 0.9979355
pos2018            2842202543499.796 324275566161917504.000  0.00000876 0.9999930
ano2011                        0.222                196.496  0.00112810 0.9990999
ano2012                        1.179                330.843  0.00356436 0.9971561
ano2013                        0.890                397.685  0.00223694 0.9982152
ano2014                        1.957                125.512  0.01558929 0.9875621
ano2015                        2.982                396.852  0.00751471 0.9940042
ano2016                        5.553                470.073  0.01181333 0.9905746
ano2017                        6.832                355.406  0.01922433 0.9846622
ano2018                        4.838                355.006  0.01362761 0.9891271
ano2019           -2842202543498.518 264769891052461504.000 -0.00001073 0.9999914
ano2020           -2842202543497.091 374441170828414656.000 -0.00000759 0.9999939
ano2021           -2842202543497.189 324275566161908160.000 -0.00000876 0.9999930
ano2022           -2842202543496.979 324275566161925760.000 -0.00000876 0.9999930
pos2018:res_ou                 0.863                  5.743  0.15017708 0.8806255
pos2018:ti                     2.874                  0.677  4.24755884 0.0000216
pos2018:res_ou:ti              9.852               2739.434  0.00359654 0.9971304

Specifically, the problem happens sharply on pos2018, ano2019, ano2020, ano2021 which are all dummies. The first is set to 1 for years after 2018 and 0 otherwise; the remainder are year-specific dummies. Even setting fixed_effects = ~ ano produce strange results.

lm_mod <- estimatr::lm_robust(tx_hom_tot ~ pos2018 + res_ou:pos2018 + ti:pos2018 + res_ou:ti:pos2018, data = df, se_type = "HC1", fixed_effects = ~ ano)
summary(lm_mod) 

Standard error type:  HC1 

Coefficients: (1 not defined because the design matrix is rank deficient)
                  Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper    DF
pos2018                 NA         NA      NA         NA       NA       NA    NA
pos2018:res_ou       0.862      0.707    1.22 0.22251985   -0.523     2.25 50693
pos2018:ti           2.874      0.640    4.49 0.00000709    1.620     4.13 50693
pos2018:res_ou:ti    9.824      2.252    4.36 0.00001292    5.409    14.24 50693

When using lmtest::coeftest(lm_mod, vcov. = plm::vcovHC(lm_mod, type = "HC1")) it seems to produce correct results. So why is this happening with lm_robust? Could it be due to the interpretation of the NAs in the data by the function?

victorrssx avatar Jun 25 '24 15:06 victorrssx

Notice that lm is dropping 2022, probably due to colinearity in your data. I would try removing it manually from the model and seeing how lm_robust does.

nfultz avatar Jun 25 '24 16:06 nfultz

yes -- pos2018 is collinear with ano and ano is collinear with the constant you cannot estimate a coefficient for pos2018 when you include ano fixed effects the different coefficients reflect different ways of addressing collinearity

the fixed effect results you get look good; note pos2018 is still (correctly) dropped

note also that the interaction terms are all the same and these do not depend on this collinearity!

confirm when you drop pos2018

lm_mod1 <- stats::lm(tx_hom_tot ~ res_ou:pos2018 + ti:pos2018 + res_ou:ti:pos2018 + ano+ , data = df)> > > lm_mod2 <- estimatr::lm_robust(tx_hom_tot ~ res_ou:pos2018 + ti:pos2018 + res_ou:ti:pos2018 + ano+ , data = df,+ se_type = "HC1")> cbind(coef(lm_mod1) , coef(lm_mod2)) [,1] [,2] (Intercept) 26.8272932 26.8272932 ano2011 0.2199035 0.2199035 ano2012 1.1821015 1.1821015 ano2013 0.8924856 0.8924856 ano2014 1.9595268 1.9595268 ano2015 2.9851035 2.9851035 ano2016 5.5560158 5.5560158 ano2017 6.8353214 6.8353214 ano2018 4.8407606 4.8407606 ano2019 1.2935977 1.2935977 ano2020 3.2067074 3.2067074 ano2021 2.6553411 2.6553411 ano2022 2.9270375 2.9270375 res_ou:pos2018 0.8624844 0.8624844 pos2018:ti 2.8738130 2.8738130 res_ou:pos2018:ti 9.8239281 9.8239281

On Tue, Jun 25, 2024 at 6:00 PM Neal Fultz @.***> wrote:

Notice that lm is dropping 2022, probably due to colinearity in your data. I would try removing it manually from the model and seeing how lm_robust does.

— Reply to this email directly, view it on GitHub https://github.com/DeclareDesign/estimatr/issues/411#issuecomment-2189344813, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBE57PZWXGX4CNSAGNW47LZJGHZVAVCNFSM6AAAAABJ4EWVASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGM2DIOBRGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

macartan avatar Jun 25 '24 16:06 macartan

@nfultz @macartan If I get it correctly I think you guys are right. Since pos2018 is defined as 1 after 2018 and 0 otherwise it can be seen as a linear combination of year dummies from 2019 to 2022. Removing pos2018 effectively return equal estimated coefficients apart the package used. I tried also with fixest::feols:

                   stats::lm fixest::feols estimatr::lm_robust
(Intercept)       26.8272932    26.8272932          26.8272932
ano2011            0.2199035     0.2199035           0.2199035
ano2012            1.1821015     1.1821014           1.1821015
ano2013            0.8924856     0.8924856           0.8924856
ano2014            1.9595268     1.9595268           1.9595268
ano2015            2.9851035     2.9851034           2.9851035
ano2016            5.5560158     5.5560158           5.5560158
ano2017            6.8353214     6.8353214           6.8353214
ano2018            4.8407606     4.8407606           4.8407606
ano2019            1.2935977     1.2935976           1.2935977
ano2020            3.2067074     3.2067073           3.2067074
ano2021            2.6553411     2.6553411           2.6553411
ano2022            2.9270375     2.9270374           2.9270375
res_ou:pos2018     0.8624844   -25.9648088           0.8624844
pos2018:ti         2.8738130     2.8738130           2.8738130
res_ou:pos2018:ti  9.8239281   -17.0033651           9.8239281 

Note that fixest::feols returns a warning message when problems like these occur.

The variable 'pos2018' has been removed because of collinearity (see $collin.var).

Perhaps a message like this could be a future improvement due to cases where you accidentally construct new variables that are collinear with the original ones (and you don't notice at first glance). Thank you!

victorrssx avatar Jun 25 '24 21:06 victorrssx