fixest Sparse model matrix, hatvalues, and HC2/HC3/HU standard errors

Sparse model matrix, hatvalues, and HC2/HC3/HU standard errors

Open kylebutts opened this issue 1 year ago • 9 comments

I've added sparse_model_matrix implementation.

Some development notes:

The code started from what we had. I reworked the way the fixed effect matrices were generated to (i) work with time-varying slopes and (ii) work when sparse_model_matrix is passed a formula instead of a fixest object. I used fixef_terms to get the fe_vars and slope_var_list names. Then prepare_df gives me the variables (with ^ interactions). Then, making a sparse matrix with them approximately follows from what you wrote

library(fixest)
sp = sparse_model_matrix(
  wt ~ hp | cyl + cyl^gear + cyl[mpg], 
  mtcars,
  type = "fixef",
  collin.rm = FALSE
)
sp
#> 32 x 14 sparse Matrix of class "dgCMatrix"
#>   [[ suppressing 14 column names 'cyl::4', 'cyl::6', 'cyl::8' ... ]]
#>                                           
#>  [1,] . 1 .  .   21.0  .   . . . . 1 . . .
#>  [2,] . 1 .  .   21.0  .   . . . . 1 . . .
#>  [3,] 1 . . 22.8  .    .   . 1 . . . . . .
#>  [4,] . 1 .  .   21.4  .   . . . 1 . . . .
#>  [5,] . . 1  .    .   18.7 . . . . . . 1 .
#>  [6,] . 1 .  .   18.1  .   . . . 1 . . . .
#>  [7,] . . 1  .    .   14.3 . . . . . . 1 .
#>  [8,] 1 . . 24.4  .    .   . 1 . . . . . .
#>  [9,] 1 . . 22.8  .    .   . 1 . . . . . .
#> [10,] . 1 .  .   19.2  .   . . . . 1 . . .
#> [11,] . 1 .  .   17.8  .   . . . . 1 . . .
#> [12,] . . 1  .    .   16.4 . . . . . . 1 .
#> [13,] . . 1  .    .   17.3 . . . . . . 1 .
#> [14,] . . 1  .    .   15.2 . . . . . . 1 .
#> [15,] . . 1  .    .   10.4 . . . . . . 1 .
#> [16,] . . 1  .    .   10.4 . . . . . . 1 .
#> [17,] . . 1  .    .   14.7 . . . . . . 1 .
#> [18,] 1 . . 32.4  .    .   . 1 . . . . . .
#> [19,] 1 . . 30.4  .    .   . 1 . . . . . .
#> [20,] 1 . . 33.9  .    .   . 1 . . . . . .
#> [21,] 1 . . 21.5  .    .   1 . . . . . . .
#> [22,] . . 1  .    .   15.5 . . . . . . 1 .
#> [23,] . . 1  .    .   15.2 . . . . . . 1 .
#> [24,] . . 1  .    .   13.3 . . . . . . 1 .
#> [25,] . . 1  .    .   19.2 . . . . . . 1 .
#> [26,] 1 . . 27.3  .    .   . 1 . . . . . .
#> [27,] 1 . . 26.0  .    .   . . 1 . . . . .
#> [28,] 1 . . 30.4  .    .   . . 1 . . . . .
#> [29,] . . 1  .    .   15.8 . . . . . . . 1
#> [30,] . 1 .  .   19.7  .   . . . . . 1 . .
#> [31,] . . 1  .    .   15.0 . . . . . . . 1
#> [32,] 1 . . 21.4  .    .   . 1 . . . . . .

colnames(sp)
#>  [1] "cyl::4"        "cyl::6"        "cyl::8"        "cyl[[mpg]]::4"
#>  [5] "cyl[[mpg]]::6" "cyl[[mpg]]::8" "cyl^gear::4_3" "cyl^gear::4_4"
#>  [9] "cyl^gear::4_5" "cyl^gear::6_3" "cyl^gear::6_4" "cyl^gear::6_5"
#> [13] "cyl^gear::8_3" "cyl^gear::8_5"

When no data argument is passed, the original data is used. When na.rm == TRUE, all the rows with NA in the original regression are dropped. If data is passed (even if it's the same data), only the NAs in the particular part of the model.matrix are dropped (e.g. if type = "lhs" is used, then only NAs on the LHS are dropped). Alternatively, I could update the regression object, but I decided against it since that would also cause confusions. For example, "why are rows in y being dropped, I don't have any NAs there"
You can pass a formula directly to sparse_model_matrix without even estimating for type %in% c("lhs", "rhs", "fixest"), but if you want na.rm = TRUE or collin.rm = TRUE, then the model needs to be estimated. This is useful as a "sparse matrix factory function"
I didn't implement subset. I can, but the code seems complicated.
I copied all the tests from model.matrix and added a few more pertaining to fixed effects.
I didn't dive into getting poly to work when data is passed. I could think about trying to have a rough warning that comes when "poly" appears in the formula and data is passed?
There is the option combine which when equal to FALSE, returns a named list with sparse matrices for each element in type.

Examples

library(fixest)
est = feols(mpg ~ drat | cyl, mtcars)

sparse_model_matrix(est, type = "lhs")
#> 32 x 1 sparse Matrix of class "dgCMatrix"
#>        mpg
#>  [1,] 21.0
#>  [2,] 21.0
#>  [3,] 22.8
#>  [4,] 21.4
#>  [5,] 18.7
#>  [6,] 18.1
#>  [7,] 14.3
#>  [8,] 24.4
#>  [9,] 22.8
#> [10,] 19.2
#> [11,] 17.8
#> [12,] 16.4
#> [13,] 17.3
#> [14,] 15.2
#> [15,] 10.4
#> [16,] 10.4
#> [17,] 14.7
#> [18,] 32.4
#> [19,] 30.4
#> [20,] 33.9
#> [21,] 21.5
#> [22,] 15.5
#> [23,] 15.2
#> [24,] 13.3
#> [25,] 19.2
#> [26,] 27.3
#> [27,] 26.0
#> [28,] 30.4
#> [29,] 15.8
#> [30,] 19.7
#> [31,] 15.0
#> [32,] 21.4

sparse_model_matrix(est, type = c("rhs", "fixef"))
#> 32 x 4 sparse Matrix of class "dgCMatrix"
#>       drat cyl::4 cyl::6 cyl::8
#>  [1,] 3.90      .      1      .
#>  [2,] 3.90      .      1      .
#>  [3,] 3.85      1      .      .
#>  [4,] 3.08      .      1      .
#>  [5,] 3.15      .      .      1
#>  [6,] 2.76      .      1      .
#>  [7,] 3.21      .      .      1
#>  [8,] 3.69      1      .      .
#>  [9,] 3.92      1      .      .
#> [10,] 3.92      .      1      .
#> [11,] 3.92      .      1      .
#> [12,] 3.07      .      .      1
#> [13,] 3.07      .      .      1
#> [14,] 3.07      .      .      1
#> [15,] 2.93      .      .      1
#> [16,] 3.00      .      .      1
#> [17,] 3.23      .      .      1
#> [18,] 4.08      1      .      .
#> [19,] 4.93      1      .      .
#> [20,] 4.22      1      .      .
#> [21,] 3.70      1      .      .
#> [22,] 2.76      .      .      1
#> [23,] 3.15      .      .      1
#> [24,] 3.73      .      .      1
#> [25,] 3.08      .      .      1
#> [26,] 4.08      1      .      .
#> [27,] 4.43      1      .      .
#> [28,] 3.77      1      .      .
#> [29,] 4.22      .      .      1
#> [30,] 3.62      .      1      .
#> [31,] 3.54      .      .      1
#> [32,] 4.11      1      .      .

sparse_model_matrix(
  mpg ~ i(vs) | gear^cyl, data = mtcars, 
  type = c("rhs", "fixef"),
  collin.rm = FALSE
)
#> 32 x 10 sparse Matrix of class "dgCMatrix"
#>   [[ suppressing 10 column names 'vs::0', 'vs::1', 'gear^cyl::3_4' ... ]]
#>                          
#>  [1,] 1 . . . . . 1 . . .
#>  [2,] 1 . . . . . 1 . . .
#>  [3,] . 1 . . . 1 . . . .
#>  [4,] . 1 . 1 . . . . . .
#>  [5,] 1 . . . 1 . . . . .
#>  [6,] . 1 . 1 . . . . . .
#>  [7,] 1 . . . 1 . . . . .
#>  [8,] . 1 . . . 1 . . . .
#>  [9,] . 1 . . . 1 . . . .
#> [10,] . 1 . . . . 1 . . .
#> [11,] . 1 . . . . 1 . . .
#> [12,] 1 . . . 1 . . . . .
#> [13,] 1 . . . 1 . . . . .
#> [14,] 1 . . . 1 . . . . .
#> [15,] 1 . . . 1 . . . . .
#> [16,] 1 . . . 1 . . . . .
#> [17,] 1 . . . 1 . . . . .
#> [18,] . 1 . . . 1 . . . .
#> [19,] . 1 . . . 1 . . . .
#> [20,] . 1 . . . 1 . . . .
#> [21,] . 1 1 . . . . . . .
#> [22,] 1 . . . 1 . . . . .
#> [23,] 1 . . . 1 . . . . .
#> [24,] 1 . . . 1 . . . . .
#> [25,] 1 . . . 1 . . . . .
#> [26,] . 1 . . . 1 . . . .
#> [27,] 1 . . . . . . 1 . .
#> [28,] . 1 . . . . . 1 . .
#> [29,] 1 . . . . . . . . 1
#> [30,] 1 . . . . . . . 1 .
#> [31,] 1 . . . . . . . . 1
#> [32,] . 1 . . . 1 . . . .

^{Created on 2023-05-17 with reprex v2.0.2}

May 18 '23 03:05 kylebutts

BTW, probably need to make a few more changes to make it faster, so I'll push and circle back when I think it's in a good state!

May 18 '23 15:05 kylebutts

@lrberge I think we are good to merge!

Some dev notes for HC2/HC3 implementation:

Small-sample adjustments are a bit awkward with HC2/HC3 standard errors (since they are a form of small-sample adjustment). vcov_hetero_internal multiplies by N / N-1 (cluster.adj) which isn't needed for HC2/HC3. This makes vcov(est, "hc2", ssc = scc(adj = FALSE)) not match sandwich::vcovHC(est, "HC2") which is not a good default IMO. To fix this, instead of making the adjustment in vcov_hetero_internal, I instead adjust the scores before getting to that point by multiplying scores by sqrt(n / n - 1) in the scores section of the code. The reason I think this is the best option is in the scores section of vcov.fixest, I have to update the scores for HC2/HC3 anyways, so it's natural to make HC1 adjustments to the score as well at that point (see lines 757-794 in R/VCOV.R).
I don't think this code does anything. The attribute ss_adj is not added by any vcov function

    ss_adj = attr(vcov_noAdj, "ss_adj")
    if(!is.null(ss_adj)){
        if(is.function(ss_adj)){
            ss_adj = ss_adj(n = n, K = K)
        }
        attr(vcov_noAdj, "ss_adj") = NULL
    }

Jun 07 '23 21:06 kylebutts

@kylebutts This looks very interesting. Can you provide some quick benchmarks/comparisons to show practical gains over the current approach? Ofc we'd expect memory allocations to decrease significantly, but I'm curious to see speed implications too. Thanks.

Jun 12 '23 05:06 grantmcdermott

@grantmcdermott I think memory usage will be about the same in most cases. model.matrix.fixest will return just a single vector for fixed effects (so not the matrix of 0/1s). That's the same memory footprint as a sparse matrix of the 0/1 dummies. The only difference is with i() which sparse will save on memory.

The creation step is just slightly slower than Matrix::sparse.model.matrix (tens of milliseconds), but (1) has more features (collin.rm) and (2) allows full use of fixest formula semantics. The slow step is adding colnames to the matrix actually (pasting strings together is surprisingly slow)

The real advantage is not the speed in which you can generate the matrix but the uses afterwards. For example, https://github.com/s3alfisc/summclust/pull/19#issue-1751016605 has a 13x speedup.

Here's a quick example of the speedups of avoiding dense matrices:

library(data.table)
library(fixest)
library(Matrix)


# Test -------------------------------------------------------------------------

N = 10000
var_fes = 10
var_eps = 4

# Ensure P_ii exists
N_i = pmax(1, rpois(N, 3))

# Create dataset
df = data.table(id = rep(1:N, N_i))
df[, j := 1:.N, by = id]
df[, fe_i := rnorm(1, 0, sd = sqrt(var_fes)), by = id]
df[, x := rnorm(.N, mean = fe_i, sd = 1)]
df[, x2 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x3 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x4 := rnorm(.N, mean = fe_i, sd = 1)]
df[, y := x * 1 + fe_i + rnorm(.N, 0, sqrt(var_eps))]

# Estimate FE model

model = feols(y ~ x | id, df)

X = fixest:::sparse_model_matrix(
  y ~ x | id, df, type = c("rhs", "fixef"), 
  combine = TRUE, collin.rm = FALSE
)
X_dense = as.matrix(X)
#> Warning in asMethod(object): sparse->dense coercion: allocating vector of size
#> 2.3 GiB

bench::mark(
  Matrix::crossprod(X, df$y),
  Matrix::crossprod(X_dense, df$y),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 Matrix::crossprod(X, df$y)          285µs   322µs   2620.       493KB     2.20
#> 2 Matrix::crossprod(X_dense, df$y)    488ms   499ms      2.00    89.8KB     0

Jun 12 '23 17:06 kylebutts

Just to chime in here - there are indeed large performance advantages for the cluster jackknife CRV3 estimator as suggested by MNW (2023) if X is sparse. And if I read the MNW paper correctly, I think that many people will be interested in using the CRV3 estimator in the future =) Unfortunately, it does not play too well with (arbitrary) fixed effects, so one has to bite the bullet and create the design matrix of all dummies in most cases.

Here is a quick implementation of a part of the algorithm, which works both for sparse and dense matrices:

library(Matrix)

N <- 100000
k <- 100
y <- rnorm(N)
X <- factor(sample(1:k, N, TRUE))
df <- data.frame(X = X)
mf <- model.frame(~X, X)
X <- sparse.model.matrix(mf)
beta <- rnorm(ncol(X))
y <- X %*% beta + rnorm(N)
cluster <- sample(letters, N, TRUE)

length(unique(cluster)
# 26 clusters


y_dense <- as.matrix(y)
X_dense <- as.matrix(X)

cluster_jackknife <- function(X = X, y = y, cluster = cluster){
  

  XXinv <- solve(crossprod(X))
  Xy <- t(X) %*% y
  
  cluster_groups <- unique(cluster)
  XgXg_list <- lapply(cluster_groups, function(g) solve(crossprod(X[cluster == g,])))
  Xgyg_list <- lapply(cluster_groups, function(g) t(X[cluster == g,]) %*% y[cluster == g])
  
  # compute leave-one-out regression coefs
  beta_jack <- lapply(seq_along(cluster_groups), function(g) MASS::ginv(as.matrix(XXinv - XgXg_list[[g]])) %*% (Xy -Xgyg_list[[g]]))
  beta_jack
}

microbenchmark::microbenchmark(
  cluster_jackknife(X_dense, y_dense, cluster),
  cluster_jackknife(X, y, cluster), 
  times = 1
)

# cluster_jackknife(X_dense, y_dense, cluster)
# cluster_jackknife(X, y, cluster)
# min        lq      mean    median        uq       max neval
# 1794.6250 1794.6250 1794.6250 1794.6250 1794.6250 1794.6250     1
# 385.1208  385.1208  385.1208  385.1208  385.1208  385.1208     1

So in this case, the sparse version is around 4-5x faster. The larger k and the dimension of clusters, the larger the speed gains will be.

Of course you could argue that I could just create the model matrix myself (at the cost of some performance overhead) - but another great advantage is that this PR creates the full model matrix of dummies - which will allow me to support all sort of fixest syntax I currently do not know how to handle both in fwildclusterboot and summclust (e.g. varying slopes).

Jun 12 '23 20:06 s3alfisc

Figured out a faster way to implement this using block formula for projection matrix: https://en.wikipedia.org/wiki/Projection_matrix#Blockwise_formula

cov.iid already has solved for (Xtilde' Xtilde)^{-1} where Xtilde is demeaned by fixed effects and slope vars
Calculating projection matrix of a bunch of fixed effects is really fast
The only pain is slope_vars which I remove and demean by fixed effects using fixest::demean.

Before (using approx to speed it up):

library(data.table)
library(fixest)
library(Matrix)
library(tictoc)
#> 
#> Attaching package: 'tictoc'
#> The following object is masked from 'package:data.table':
#> 
#>     shift

# devtools::load_all()

# Test -------------------------------------------------------------------------

N = 100000
var_fes = 10
var_eps = 4

# Ensure P_ii exists
N_i = pmax(2, rpois(N, 3))

# Create dataset
df = data.table(id = rep(1:N, N_i))
df[, j := 1:.N, by = id]
df[, fe_i := rnorm(1, 0, sd = sqrt(var_fes)), by = id]
df[, x := rnorm(.N, mean = fe_i, sd = 1)]
df[, x2 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x3 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x4 := rnorm(.N, mean = fe_i, sd = 1)]
df[, g := sample(1:25, .N, replace = TRUE)]
df[, y := x * 1 + fe_i + rnorm(.N, 0, sqrt(var_eps))]
df = as.data.frame(df)

# Estimate FE model
model = feols(y ~ x + x2 + x3 + x4 | id + g, df)

tic()
P_ii_orig = hatvalues(model, exact = FALSE, p = 500)
toc()
#> 12.39 sec elapsed

^{Created on 2023-06-17 with reprex v2.0.2}

After (solving exactly):

library(data.table)
library(fixest)
library(Matrix)
library(tictoc)
#> 
#> Attaching package: 'tictoc'
#> The following object is masked from 'package:data.table':
#> 
#>     shift

# devtools::load_all()

# Test -------------------------------------------------------------------------

N = 100000
var_fes = 10
var_eps = 4

# Ensure P_ii exists
N_i = pmax(2, rpois(N, 3))

# Create dataset
df = data.table(id = rep(1:N, N_i))
df[, j := 1:.N, by = id]
df[, fe_i := rnorm(1, 0, sd = sqrt(var_fes)), by = id]
df[, x := rnorm(.N, mean = fe_i, sd = 1)]
df[, x2 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x3 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x4 := rnorm(.N, mean = fe_i, sd = 1)]
df[, g := sample(1:25, .N, replace = TRUE)]
df[, y := x * 1 + fe_i + rnorm(.N, 0, sqrt(var_eps))]
df = as.data.frame(df)

# Estimate FE model
model = feols(y ~ x + x2 + x3 + x4 | id + g, df)

tic()
P_ii_orig = hatvalues(model, exact = TRUE)
toc()
#> 1.1 sec elapsed

^{Created on 2023-06-17 with reprex v2.0.2}

Jun 17 '23 18:06 kylebutts

Very cool PR! Nice work! I'll try to merge it next week.

Feb 07 '24 22:02 lrberge

@lrberge I put a couple things in this one PR. Want me rebase onto main? I'll also adjust the work with data.table commits based on your changes

b992cd7 is for did2s printing of results with a custom VCOV. I thought you had fixed it, but it doesn't seem to work.
Sparse model matrix is super useful to allow use of fixest functions + other statistical packages
- It's been a while now, but I'm wondering if it's possible to make sparse_model_matrix(feols(..., only.env = TRUE)) work
hatvalues.fixest ties directly into HC3/HC4/HU vcov and sandwich functions that need hatvalues

Feb 07 '24 23:02 kylebutts

fixest fixest copied to clipboard

Sparse model matrix, hatvalues, and HC2/HC3/HU standard errors

fixest
fixest copied to clipboard