estimatr icon indicating copy to clipboard operation
estimatr copied to clipboard

"std::bad_alloc" error from lm_robust with CR2 errors.

Open aaronrudkin opened this issue 6 years ago • 4 comments

I have run into a replicable "std::bad_alloc" error in fitting a regression when using CR2 standard errors. If I use other standard errors or remove clustering it works no problem. I don't know if there's a numerical stability issue or what. Had to upload the data file zipped for GitHub to make this work so please unzip.

Minimum replicable failing example:

my_df = read_dta("maindata.dta")
reg = lm_robust(guerrilla ~ cofvalue + as.factor(year) + as.factor(coddist), data = my_df, clusters = as.factor(coddist))

Error:

Error in lm_variance(X = if (se_type %in% c("HC2", "HC3", "CR2") && res) cbind(data[["x"]], : std::bad_alloc

Given the std prefix my assumption is this is further down the RCPP rabbit hole.

I can replicate this on my Windows desktop (16GB RAM, nothing else running, don't see R process hitting RAM limits as best as I can tell. Also happens on my MacBook, and another user could replicate it on a loaner Mac and a Windows laptop.

I guess the two issues are:

  1. What's causing this and does it suggest any broader class of regressions that won't run?
  2. If it's a resource constraint, probably a more user-informative error would be useful.

maindata.dta.zip

aaronrudkin avatar Dec 05 '18 19:12 aaronrudkin

Thank you for this report. It was showing up on the Solaris CRAN checks and is a high priority bug. I hope to find time to address it in 2018.

lukesonnet avatar Dec 18 '18 14:12 lukesonnet

@lukesonnet Have you had an luck? Do you need help?

jlsutherland avatar Feb 01 '19 16:02 jlsutherland

This seems to be a problem where we are looking to simply allocate way too much memory as CR2 is very demanding. I think it is a resource constraint problem, but I'm not sure how to catch it.

I'm definitely open to solutions.

lukesonnet avatar Feb 01 '19 16:02 lukesonnet

For what it's worth, the problem is particularly acute when you have many clusters. See the below minimal working example. Obviously this doesn't just affect lm_robust() but also commarobust().

library(estimatr)

# Data w/ many clusters
df_bigc <- data.frame(y = rnorm(300000), 
                        x = rnorm(300000), 
                        cluster = rep(c(1:1000),300))

# Data w/ few clusters
df_smallc <- data.frame(y = rnorm(300000), 
                      x = rnorm(300000), 
                      cluster = rep(c(1:2),150000))

# CR0, few clusters - success!
estimatr::lm_robust(formula = y ~ x, 
                    data = df_smallc, 
                    clusters = cluster, 
                    se_type = "CR0")

# CR0, many clusters - success!
estimatr::lm_robust(formula = y ~ x, 
                    data = df_bigc, 
                    clusters = cluster, 
                    se_type = "CR0")

#CR2, few clusters - typically gives error
estimatr::lm_robust(formula = y ~ x, 
                    data = df_smallc, 
                    clusters = cluster, 
                    se_type = "CR2")

#CR2, many clusters - typically crashes
estimatr::lm_robust(formula = y ~ x, 
                    data = df_bigc, 
                    clusters = cluster, 
                    se_type = "CR2")

bgall avatar Jul 24 '19 23:07 bgall