fixest
fixest copied to clipboard
Sparse model matrix, hatvalues, and HC2/HC3/HU standard errors
I've added sparse_model_matrix implementation.
Some development notes:
- The code started from what we had. I reworked the way the fixed effect matrices were generated to (i) work with time-varying slopes and (ii) work when
sparse_model_matrix
is passed a formula instead of afixest
object. I usedfixef_terms
to get thefe_vars
andslope_var_list
names. Thenprepare_df
gives me the variables (with^
interactions). Then, making a sparse matrix with them approximately follows from what you wrote
library(fixest)
sp = sparse_model_matrix(
wt ~ hp | cyl + cyl^gear + cyl[mpg],
mtcars,
type = "fixef",
collin.rm = FALSE
)
sp
#> 32 x 14 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 14 column names 'cyl::4', 'cyl::6', 'cyl::8' ... ]]
#>
#> [1,] . 1 . . 21.0 . . . . . 1 . . .
#> [2,] . 1 . . 21.0 . . . . . 1 . . .
#> [3,] 1 . . 22.8 . . . 1 . . . . . .
#> [4,] . 1 . . 21.4 . . . . 1 . . . .
#> [5,] . . 1 . . 18.7 . . . . . . 1 .
#> [6,] . 1 . . 18.1 . . . . 1 . . . .
#> [7,] . . 1 . . 14.3 . . . . . . 1 .
#> [8,] 1 . . 24.4 . . . 1 . . . . . .
#> [9,] 1 . . 22.8 . . . 1 . . . . . .
#> [10,] . 1 . . 19.2 . . . . . 1 . . .
#> [11,] . 1 . . 17.8 . . . . . 1 . . .
#> [12,] . . 1 . . 16.4 . . . . . . 1 .
#> [13,] . . 1 . . 17.3 . . . . . . 1 .
#> [14,] . . 1 . . 15.2 . . . . . . 1 .
#> [15,] . . 1 . . 10.4 . . . . . . 1 .
#> [16,] . . 1 . . 10.4 . . . . . . 1 .
#> [17,] . . 1 . . 14.7 . . . . . . 1 .
#> [18,] 1 . . 32.4 . . . 1 . . . . . .
#> [19,] 1 . . 30.4 . . . 1 . . . . . .
#> [20,] 1 . . 33.9 . . . 1 . . . . . .
#> [21,] 1 . . 21.5 . . 1 . . . . . . .
#> [22,] . . 1 . . 15.5 . . . . . . 1 .
#> [23,] . . 1 . . 15.2 . . . . . . 1 .
#> [24,] . . 1 . . 13.3 . . . . . . 1 .
#> [25,] . . 1 . . 19.2 . . . . . . 1 .
#> [26,] 1 . . 27.3 . . . 1 . . . . . .
#> [27,] 1 . . 26.0 . . . . 1 . . . . .
#> [28,] 1 . . 30.4 . . . . 1 . . . . .
#> [29,] . . 1 . . 15.8 . . . . . . . 1
#> [30,] . 1 . . 19.7 . . . . . . 1 . .
#> [31,] . . 1 . . 15.0 . . . . . . . 1
#> [32,] 1 . . 21.4 . . . 1 . . . . . .
colnames(sp)
#> [1] "cyl::4" "cyl::6" "cyl::8" "cyl[[mpg]]::4"
#> [5] "cyl[[mpg]]::6" "cyl[[mpg]]::8" "cyl^gear::4_3" "cyl^gear::4_4"
#> [9] "cyl^gear::4_5" "cyl^gear::6_3" "cyl^gear::6_4" "cyl^gear::6_5"
#> [13] "cyl^gear::8_3" "cyl^gear::8_5"
-
When no data argument is passed, the original data is used. When
na.rm == TRUE
, all the rows with NA in the original regression are dropped. If data is passed (even if it's the same data), only the NAs in the particular part of the model.matrix are dropped (e.g. iftype = "lhs"
is used, then only NAs on the LHS are dropped). Alternatively, I couldupdate
the regression object, but I decided against it since that would also cause confusions. For example, "why are rows in y being dropped, I don't have any NAs there" -
You can pass a formula directly to sparse_model_matrix without even estimating for
type %in% c("lhs", "rhs", "fixest")
, but if you wantna.rm = TRUE
orcollin.rm = TRUE
, then the model needs to be estimated. This is useful as a "sparse matrix factory function" -
I didn't implement
subset
. I can, but the code seems complicated. -
I copied all the tests from
model.matrix
and added a few more pertaining to fixed effects. -
I didn't dive into getting
poly
to work whendata
is passed. I could think about trying to have a rough warning that comes when"poly"
appears in the formula anddata
is passed? -
There is the option
combine
which when equal toFALSE
, returns a named list with sparse matrices for each element intype
.
Examples
library(fixest)
est = feols(mpg ~ drat | cyl, mtcars)
sparse_model_matrix(est, type = "lhs")
#> 32 x 1 sparse Matrix of class "dgCMatrix"
#> mpg
#> [1,] 21.0
#> [2,] 21.0
#> [3,] 22.8
#> [4,] 21.4
#> [5,] 18.7
#> [6,] 18.1
#> [7,] 14.3
#> [8,] 24.4
#> [9,] 22.8
#> [10,] 19.2
#> [11,] 17.8
#> [12,] 16.4
#> [13,] 17.3
#> [14,] 15.2
#> [15,] 10.4
#> [16,] 10.4
#> [17,] 14.7
#> [18,] 32.4
#> [19,] 30.4
#> [20,] 33.9
#> [21,] 21.5
#> [22,] 15.5
#> [23,] 15.2
#> [24,] 13.3
#> [25,] 19.2
#> [26,] 27.3
#> [27,] 26.0
#> [28,] 30.4
#> [29,] 15.8
#> [30,] 19.7
#> [31,] 15.0
#> [32,] 21.4
sparse_model_matrix(est, type = c("rhs", "fixef"))
#> 32 x 4 sparse Matrix of class "dgCMatrix"
#> drat cyl::4 cyl::6 cyl::8
#> [1,] 3.90 . 1 .
#> [2,] 3.90 . 1 .
#> [3,] 3.85 1 . .
#> [4,] 3.08 . 1 .
#> [5,] 3.15 . . 1
#> [6,] 2.76 . 1 .
#> [7,] 3.21 . . 1
#> [8,] 3.69 1 . .
#> [9,] 3.92 1 . .
#> [10,] 3.92 . 1 .
#> [11,] 3.92 . 1 .
#> [12,] 3.07 . . 1
#> [13,] 3.07 . . 1
#> [14,] 3.07 . . 1
#> [15,] 2.93 . . 1
#> [16,] 3.00 . . 1
#> [17,] 3.23 . . 1
#> [18,] 4.08 1 . .
#> [19,] 4.93 1 . .
#> [20,] 4.22 1 . .
#> [21,] 3.70 1 . .
#> [22,] 2.76 . . 1
#> [23,] 3.15 . . 1
#> [24,] 3.73 . . 1
#> [25,] 3.08 . . 1
#> [26,] 4.08 1 . .
#> [27,] 4.43 1 . .
#> [28,] 3.77 1 . .
#> [29,] 4.22 . . 1
#> [30,] 3.62 . 1 .
#> [31,] 3.54 . . 1
#> [32,] 4.11 1 . .
sparse_model_matrix(
mpg ~ i(vs) | gear^cyl, data = mtcars,
type = c("rhs", "fixef"),
collin.rm = FALSE
)
#> 32 x 10 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 10 column names 'vs::0', 'vs::1', 'gear^cyl::3_4' ... ]]
#>
#> [1,] 1 . . . . . 1 . . .
#> [2,] 1 . . . . . 1 . . .
#> [3,] . 1 . . . 1 . . . .
#> [4,] . 1 . 1 . . . . . .
#> [5,] 1 . . . 1 . . . . .
#> [6,] . 1 . 1 . . . . . .
#> [7,] 1 . . . 1 . . . . .
#> [8,] . 1 . . . 1 . . . .
#> [9,] . 1 . . . 1 . . . .
#> [10,] . 1 . . . . 1 . . .
#> [11,] . 1 . . . . 1 . . .
#> [12,] 1 . . . 1 . . . . .
#> [13,] 1 . . . 1 . . . . .
#> [14,] 1 . . . 1 . . . . .
#> [15,] 1 . . . 1 . . . . .
#> [16,] 1 . . . 1 . . . . .
#> [17,] 1 . . . 1 . . . . .
#> [18,] . 1 . . . 1 . . . .
#> [19,] . 1 . . . 1 . . . .
#> [20,] . 1 . . . 1 . . . .
#> [21,] . 1 1 . . . . . . .
#> [22,] 1 . . . 1 . . . . .
#> [23,] 1 . . . 1 . . . . .
#> [24,] 1 . . . 1 . . . . .
#> [25,] 1 . . . 1 . . . . .
#> [26,] . 1 . . . 1 . . . .
#> [27,] 1 . . . . . . 1 . .
#> [28,] . 1 . . . . . 1 . .
#> [29,] 1 . . . . . . . . 1
#> [30,] 1 . . . . . . . 1 .
#> [31,] 1 . . . . . . . . 1
#> [32,] . 1 . . . 1 . . . .
Created on 2023-05-17 with reprex v2.0.2
BTW, probably need to make a few more changes to make it faster, so I'll push and circle back when I think it's in a good state!
@lrberge I think we are good to merge!
Some dev notes for HC2/HC3 implementation:
-
Small-sample adjustments are a bit awkward with HC2/HC3 standard errors (since they are a form of small-sample adjustment).
vcov_hetero_internal
multiplies by N / N-1 (cluster.adj) which isn't needed for HC2/HC3. This makesvcov(est, "hc2", ssc = scc(adj = FALSE))
not matchsandwich::vcovHC(est, "HC2")
which is not a good default IMO. To fix this, instead of making the adjustment invcov_hetero_internal
, I instead adjust the scores before getting to that point by multiplying scores by sqrt(n / n - 1) in the scores section of the code. The reason I think this is the best option is in the scores section ofvcov.fixest
, I have to update the scores for HC2/HC3 anyways, so it's natural to make HC1 adjustments to the score as well at that point (see lines 757-794 inR/VCOV.R
). -
I don't think this code does anything. The attribute
ss_adj
is not added by any vcov function
ss_adj = attr(vcov_noAdj, "ss_adj")
if(!is.null(ss_adj)){
if(is.function(ss_adj)){
ss_adj = ss_adj(n = n, K = K)
}
attr(vcov_noAdj, "ss_adj") = NULL
}
@kylebutts This looks very interesting. Can you provide some quick benchmarks/comparisons to show practical gains over the current approach? Ofc we'd expect memory allocations to decrease significantly, but I'm curious to see speed implications too. Thanks.
@grantmcdermott I think memory usage will be about the same in most cases. model.matrix.fixest
will return just a single vector for fixed effects (so not the matrix of 0/1s). That's the same memory footprint as a sparse matrix of the 0/1 dummies. The only difference is with i()
which sparse will save on memory.
The creation step is just slightly slower than Matrix::sparse.model.matrix
(tens of milliseconds), but (1) has more features (collin.rm
) and (2) allows full use of fixest formula semantics. The slow step is adding colnames to the matrix actually (pasting strings together is surprisingly slow)
The real advantage is not the speed in which you can generate the matrix but the uses afterwards. For example, https://github.com/s3alfisc/summclust/pull/19#issue-1751016605 has a 13x speedup.
Here's a quick example of the speedups of avoiding dense matrices:
library(data.table)
library(fixest)
library(Matrix)
# Test -------------------------------------------------------------------------
N = 10000
var_fes = 10
var_eps = 4
# Ensure P_ii exists
N_i = pmax(1, rpois(N, 3))
# Create dataset
df = data.table(id = rep(1:N, N_i))
df[, j := 1:.N, by = id]
df[, fe_i := rnorm(1, 0, sd = sqrt(var_fes)), by = id]
df[, x := rnorm(.N, mean = fe_i, sd = 1)]
df[, x2 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x3 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x4 := rnorm(.N, mean = fe_i, sd = 1)]
df[, y := x * 1 + fe_i + rnorm(.N, 0, sqrt(var_eps))]
# Estimate FE model
model = feols(y ~ x | id, df)
X = fixest:::sparse_model_matrix(
y ~ x | id, df, type = c("rhs", "fixef"),
combine = TRUE, collin.rm = FALSE
)
X_dense = as.matrix(X)
#> Warning in asMethod(object): sparse->dense coercion: allocating vector of size
#> 2.3 GiB
bench::mark(
Matrix::crossprod(X, df$y),
Matrix::crossprod(X_dense, df$y),
check = FALSE
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 Matrix::crossprod(X, df$y) 285µs 322µs 2620. 493KB 2.20
#> 2 Matrix::crossprod(X_dense, df$y) 488ms 499ms 2.00 89.8KB 0
Just to chime in here - there are indeed large performance advantages for the cluster jackknife CRV3 estimator as suggested by MNW (2023) if X is sparse. And if I read the MNW paper correctly, I think that many people will be interested in using the CRV3 estimator in the future =) Unfortunately, it does not play too well with (arbitrary) fixed effects, so one has to bite the bullet and create the design matrix of all dummies in most cases.
Here is a quick implementation of a part of the algorithm, which works both for sparse and dense matrices:
library(Matrix)
N <- 100000
k <- 100
y <- rnorm(N)
X <- factor(sample(1:k, N, TRUE))
df <- data.frame(X = X)
mf <- model.frame(~X, X)
X <- sparse.model.matrix(mf)
beta <- rnorm(ncol(X))
y <- X %*% beta + rnorm(N)
cluster <- sample(letters, N, TRUE)
length(unique(cluster)
# 26 clusters
y_dense <- as.matrix(y)
X_dense <- as.matrix(X)
cluster_jackknife <- function(X = X, y = y, cluster = cluster){
XXinv <- solve(crossprod(X))
Xy <- t(X) %*% y
cluster_groups <- unique(cluster)
XgXg_list <- lapply(cluster_groups, function(g) solve(crossprod(X[cluster == g,])))
Xgyg_list <- lapply(cluster_groups, function(g) t(X[cluster == g,]) %*% y[cluster == g])
# compute leave-one-out regression coefs
beta_jack <- lapply(seq_along(cluster_groups), function(g) MASS::ginv(as.matrix(XXinv - XgXg_list[[g]])) %*% (Xy -Xgyg_list[[g]]))
beta_jack
}
microbenchmark::microbenchmark(
cluster_jackknife(X_dense, y_dense, cluster),
cluster_jackknife(X, y, cluster),
times = 1
)
# cluster_jackknife(X_dense, y_dense, cluster)
# cluster_jackknife(X, y, cluster)
# min lq mean median uq max neval
# 1794.6250 1794.6250 1794.6250 1794.6250 1794.6250 1794.6250 1
# 385.1208 385.1208 385.1208 385.1208 385.1208 385.1208 1
So in this case, the sparse version is around 4-5x faster. The larger k and the dimension of clusters, the larger the speed gains will be.
Of course you could argue that I could just create the model matrix myself (at the cost of some performance overhead) - but another great advantage is that this PR creates the full model matrix of dummies - which will allow me to support all sort of fixest
syntax I currently do not know how to handle both in fwildclusterboot
and summclust
(e.g. varying slopes).
Figured out a faster way to implement this using block formula for projection matrix: https://en.wikipedia.org/wiki/Projection_matrix#Blockwise_formula
-
cov.iid
already has solved for (Xtilde' Xtilde)^{-1} where Xtilde is demeaned by fixed effects and slope vars - Calculating projection matrix of a bunch of fixed effects is really fast
- The only pain is
slope_vars
which I remove and demean by fixed effects usingfixest::demean
.
Before (using approx to speed it up):
library(data.table)
library(fixest)
library(Matrix)
library(tictoc)
#>
#> Attaching package: 'tictoc'
#> The following object is masked from 'package:data.table':
#>
#> shift
# devtools::load_all()
# Test -------------------------------------------------------------------------
N = 100000
var_fes = 10
var_eps = 4
# Ensure P_ii exists
N_i = pmax(2, rpois(N, 3))
# Create dataset
df = data.table(id = rep(1:N, N_i))
df[, j := 1:.N, by = id]
df[, fe_i := rnorm(1, 0, sd = sqrt(var_fes)), by = id]
df[, x := rnorm(.N, mean = fe_i, sd = 1)]
df[, x2 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x3 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x4 := rnorm(.N, mean = fe_i, sd = 1)]
df[, g := sample(1:25, .N, replace = TRUE)]
df[, y := x * 1 + fe_i + rnorm(.N, 0, sqrt(var_eps))]
df = as.data.frame(df)
# Estimate FE model
model = feols(y ~ x + x2 + x3 + x4 | id + g, df)
tic()
P_ii_orig = hatvalues(model, exact = FALSE, p = 500)
toc()
#> 12.39 sec elapsed
Created on 2023-06-17 with reprex v2.0.2
After (solving exactly):
library(data.table)
library(fixest)
library(Matrix)
library(tictoc)
#>
#> Attaching package: 'tictoc'
#> The following object is masked from 'package:data.table':
#>
#> shift
# devtools::load_all()
# Test -------------------------------------------------------------------------
N = 100000
var_fes = 10
var_eps = 4
# Ensure P_ii exists
N_i = pmax(2, rpois(N, 3))
# Create dataset
df = data.table(id = rep(1:N, N_i))
df[, j := 1:.N, by = id]
df[, fe_i := rnorm(1, 0, sd = sqrt(var_fes)), by = id]
df[, x := rnorm(.N, mean = fe_i, sd = 1)]
df[, x2 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x3 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x4 := rnorm(.N, mean = fe_i, sd = 1)]
df[, g := sample(1:25, .N, replace = TRUE)]
df[, y := x * 1 + fe_i + rnorm(.N, 0, sqrt(var_eps))]
df = as.data.frame(df)
# Estimate FE model
model = feols(y ~ x + x2 + x3 + x4 | id + g, df)
tic()
P_ii_orig = hatvalues(model, exact = TRUE)
toc()
#> 1.1 sec elapsed
Created on 2023-06-17 with reprex v2.0.2
Very cool PR! Nice work! I'll try to merge it next week.
@lrberge I put a couple things in this one PR. Want me rebase onto main
? I'll also adjust the work with data.table commits based on your changes
-
b992cd7 is for
did2s
printing of results with a custom VCOV. I thought you had fixed it, but it doesn't seem to work. - Sparse model matrix is super useful to allow use of fixest functions + other statistical packages
- It's been a while now, but I'm wondering if it's possible to make
sparse_model_matrix(feols(..., only.env = TRUE))
work
- It's been a while now, but I'm wondering if it's possible to make
-
hatvalues.fixest
ties directly into HC3/HC4/HU vcov andsandwich
functions that needhatvalues