Frequency weights vs Sampling Weights
Hi,
When estimating weighted least squares in Stata, one has the option to differentiate between frequency and sampling weights. In R, lm() and fixest() by default seem to assume that weights are probability weights. Maybe it would make sense to add an additional function argument for feols() to allow for frequency weights?
I believe that it should be easy to implement - basically, only the small sample adjustment factors in feols() needed to be adjusted:
This is a good discussion of different weight types in stata: https://www.parisschoolofeconomics.eu/docs/dupraz-yannick/using-weights-in-stata(1).pdf.
Here is a brief example to illustrate this - in short, I create a "long" and "aggregate" data set and estimate a model via OLS based on the "long" data and "WLS" based on the "aggregate" data. Point estimates are identical, but inferences differ due to different small sample adjustments - for WLS, the default adjustments in feols() and lm() correspond to a "probability weights" interpretation. When the data is aggregated, the number of rows in the data is M < N (the number of observations), and a small sample adjustment of (M-1) / (M-k) is applied, which aligns with "probability weights". For "frequency weights", one would have to change the ssc to (N-1) / (N-k).
library(fixest)
library(data.table)
X <- sample(1:10, 100, TRUE)
e <- sample(c(-1, 1), 100, TRUE)
Y <- 2*X + e
dt <- data.table(Y = Y,
X = X )
# create aggregate data & count duplicate observations
dt_agg <- dt[, .N, by = c("Y", "X")]
head(dt_agg)
# Y X N
# 1: 7 3 10
# 2: 7 4 7
# 3: 17 9 7
# 4: 5 3 4
# 5: 19 9 5
# 6: 13 6 5
# estimate ols on the "long" data set & wls on the "aggregated" data set
fit_ols <- feols(Y ~ X, data = dt)
fit_wls <- feols(Y ~ X, weights = ~N, data = dt_agg)
etable(fit_ols, fit_wls)
# fit_ols fit_wls
# Dependent Var.: Y Y
#
# (Intercept) -0.0969 (0.2043) -0.0969 (0.4767)
# X 2.003*** (0.0340) 2.003*** (0.0793)
So OLS and WLS create the same point estimates, but different inferential results.
To get the same inferences:
N <- nrow(dt)
M <- nrow(dt_agg)
k <- length(coef(fit_ols))
coeftable(fit_ols)[, "t value"]
coeftable(fit_wls)[, "t value"] / ((M - 1) / (M-k) )* (N-1) / (N-k)
# > coeftable(fit_wls)[, "t value"] * (M - k) / (M-1) * (N-1) / (N-k)
# (Intercept) X
# -0.1946012 24.1760943
# > (M - k) / (M-1) * (N-1) / (N-k)
# [1] 0.9570354
# > coeftable(fit_wls)[, "t value"] / ((M - 1) / (M-k) )* (N-1) / (N-k)
# (Intercept) X
# -0.1946012 24.1760943
Thanks for raising this issue, Alexander @s3alfisc, and for pointing me to this discussion.
First, I agree that this is an important issue and a source of confusion for many users of lm(). Also, other model classes in R do interpret their weights arguments as frequency weights (aka case weights) while lm() and glm() don't.
Second, I'm not sure, though, that the best way to handle this is "ex post" through extractor functions. I think it is relevant in several places and ideally should be specified when fitting the model and then stored in the model object. Then the subsequent extractor functions can deal with it appropriately.
Thus, fixest could introduce an argument that specifies how the weights should be interpreted. I haven't tried to check how difficult this would be. My gut feeling is that devil is in the detail and that this would matter in many places.
R core member Thomas Lumley has written a very nice blog post about Weights in statistics that distinguishes not only these two types of weights (precision weights vs. frequency weights) but also a third type (sampling weights as used in survey models). I encouraged Thomas to improve the documentation of lm() and glm() in this direction but unfortunately he did not follow up on this so far.
Finally, note that the discussion above sometimes confuses the precision weights (as in WLS estimation) and the sampling weights (as in survey models). Jargon is really tricky here because it is not unified and sometimes ambiguous.
Hi everyone!
This frequency weights thing is in the drawer for a while now. As @zeileis mentions: the devil is in the details.
Every time I think to implementing it, here's my line of thought: that would be super simple to implement, let's go! ----> oh, but then I need to change that method... and this method too... and another one.... and... ---> OK that's not so easy in fact, let's postpone
I agree with @zeileis that it should be within the estimating function, not in the extractor.
Regarding the implementation in fixest, the main issue is that it has consequences in many places so it's pretty tedious to implement properly (not difficult, but tedious, so really not fun).
However, that's an important feature and I'll implement it at some point (that's in my priorities for this package).
Makes a lot of sense to me. Thanks, Laurent @lrberge !