GLM.jl
GLM.jl copied to clipboard
Allow sandwich/bootstrap/jackknife standard errors as options
This is more a feature request or policy question than a bug report.
I'm wondering whether you would like to add an argument allowing to easily compute sandwich (heteroskedasticity-robust), bootstrap, jackknife and possibly other types of variance-covariance matrix and standard errors, instead of the asymptotic ones. Or whether you would like to provide a function to compute these after the model has been fitted.
R makes it relatively cumbersome to get these common standard errors (see e.g. [1] or [2]), which either means that researchers with limited programming skills are incited to keep using Stata, or to keep relying only on asymptotic standard errors. I see no good reason why such an option shouldn't be provided in the standard modeling package.
1: http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-bootstrapping.pdf 2: http://cran.r-project.org/web/packages/sandwich/vignettes/sandwich-OOP.pdf
On this issue a I have written a basic package to calculate autocorrelation and heteroskedasticity robust standard error https://github.com/gragusa/VCOV.jl.
There are methods that work for GLM.jl
models (at the moment tested only for models without weights).
I am planing to finish it in the next couple of weeks. However, it would be nice to have some infrastructure setup at the StatsBase
level.
For instance, for models such that the parameters solve \sum_{i=1}^n g(W_i, \hat{\beta}) = 0, we could think of methods (for StatisticalModel
type) that returns, \sum_{i=1}^n g(W_i, \hat{\beta}) and \sum_{i=1}^n \partial g(W_i, \hat{\beta})/\partial \beta'. This would make immediately available HAC type standard errors available for all models.
For showing the output would be only a matter to extend the show
method to accept an argument defining the type of variance.
@gragusa I had missed your comment, but it sounds cool. I guess open an issue against StatsBase with suggestions about the API to discuss this.
I wrote a (registered) package CovarianceMatrices.jl that is already pretty exhaustive in terms of features, but I feel the api can be further improved. On Fri 6 Mar 2015 at 15:29 Milan Bouchet-Valat [email protected] wrote:
@gragusa https://github.com/gragusa I had missed your comment, but it sounds cool. I guess open an issue against StatsBase with suggestions about the API to discuss this.
— Reply to this email directly or view it on GitHub https://github.com/JuliaStats/GLM.jl/issues/42#issuecomment-77566720.
I also think one should be able to specify standard errors in fit
(rather than using a vcov
function on the result of fit
as in R) for several reasons:
(i) The estimates may depend on the method chosen to specify errors. In particular, if I specify clustered errors wrt some variable, and this variable is missing for some observations, the estimate should be computed based only on those observations. This looks like an weird case but this happens in a lot with the data I'm working with
(ii) more friendly API for the user (one gets the relevant summary table right away).
From a simple google search, the issue is discussed here, here, here and here, etc.
(iii) faster (since a lot of computations are generally redundant between estimates and error estimations)
(iv) potentially lighter output of regression results since models don't need to carry every matrix to compute errors (it matters when working with datasets > Go, and obviously when working with datasets out of memory). The huge size of lm
and glm
models in R is discussed here, here, here here (and for absurd consequences, here and there).
A nice solution, which relies on multiple dispatch, would be to define an AbstractVcov type, and use it in the signature of fit
. Any child type would then a particular way to compute errors. For instance, one child type is VcovCluster
. An instance of the type can be constructed withVcovCluster(:State)
, where State is the variable I want to cluster on.
Errors method could then be specified as an argument to fit
fit(LinearModel, formula, df, VceCluster(:State))
fit(LinearModel, formula, df, VceHAC(:Year, lag = 2, kernell = Bartlett))
I've tried to do such an implementation in my fixed effects package
Interesting arguments.
The interface you suggest sounds good, but what about the implementation? How, when and with what arguments should the AbstractVcov
constructor be called, so that we can handle not only clustered standard errors, but also e.g. various kinds of bootstrap? I guess for bootstrap the fitting method should pass a function which would be called again to fit the replicates. This might deserve some common infrastructure in StatsBase
, as the procedure is essentially similar for many kinds of models.
I think in principle having the fitting method dispatch on AbstractVcov
is a good idea, but it is a nightmare from a implementation point of view.
Also, you should have also Vcov
methods for the fitted model (for
instance you estimate GLM with a Vcov
and then you want to see how the
standard errors change under different assumptions on the covariance of the
errors. But this would amounts of having lots of duplication.
That's why in CovarianceMatrices
I went the other way. I agree that for
bootstrap we can have a common infrastructure.
I think he would be enough to have a summary
object that dispatch on the
RobustVariance.
Giuseppe Ragusa
On Thu, Jun 25, 2015 at 2:04 PM, Milan Bouchet-Valat < [email protected]> wrote:
Interesting arguments.
The interface you suggest sounds good, but what about the implementation? How, when and with what arguments should the AbstractVcov constructor be called, so that we can handle not only clustered standard errors, but also e.g. various kinds of bootstrap? I guess for bootstrap the fitting method should pass a function which would be called again to fit the replicates. This might deserve some common infrastructure in StatsBase, as the procedure is essentially similar for many kinds of models.
— Reply to this email directly or view it on GitHub https://github.com/JuliaStats/GLM.jl/issues/42#issuecomment-115224341.
@gragusa Why do you think there should be code duplication? The AbstractVcov
constructor could be passed the same arguments during model fitting as after (i.e. when called on the resulting object).
The problems with only choosing the standard errors type when calling summary
have been detailed by @matthieugomez. It doesn't look like it would work very well for many cases.
Exactly, computing errors as part of the fit
does not require code duplication. I actually also define a vcov method in my package. The only difference with your package is that my vcov function (i) is not directly called by the user (ii) acts on a hat matrix, a matrix of regressor, and a vector of residuals, rather than the result of a glm model.
In fact, every type child of AbstractVcov
has two methods, a allvars
(that get all the variables needed to estimate the errors, which makes sure missing rows are removed before the estimation), and a vcov
method
The fit function in just written
function fit(formula, df, vcov_method::AbstractVcov)
...
# instead of the df[complete_cases(df[allvars(formula)] step in ModelFrame
df[complete_cases(df[allvars(formula), allvars(vcov_method])
.....
# instead of returning the usual covariance matrix
vcov(vcov_method, df, Xhat, X, residuals)
end
This is actually similar to how Stata work: the avar
package in Stata acts on matrices rather than models & is used under the hood by very different fit models https://ideas.repec.org/c/boc/bocode/s457689.html
The problem is functional redundancy (for luck of a better word). In stata you have to refit the model to get new standard error. In julia you could
reg_1 = fit(::formula, df, ::AbstractVcov)
and then have
vcov(reg_1, ::AbstractVcov)
to get new standard errors. So you have two different interfaces to do the same thing.
I don't see why that would be an issue. It works the same in R. Also, these are not really the same since vcov
only gives you the matrix, while passing the argument to fit
would mean all actions on the model would use the chosen standard errors: summary, confidence intervals, plots...
Thinking more about it, I do not agree to include variance information in the fit
method. The reason is that logically the variance info should be relevant to fit, but here it is only relevant to the summary.
If the variance of the error vector is not a multiple of the identity matrix, then the efficient estimator is different. So I would expect a different fitting method.
Conceptually, i would expect
glm_1 = fit(::GLM, df, ::WorkingVar)
and WorkingVar
would be, for instance, Exchangeable
, AR
. In this case I would expect a fitting by Estimating Equations. Then, even in this case, I could ask what the Robust variance is
summary(glm_1, ::AbstractVcov)
What i am trying to say is that a variance in fit
is logically a different object than a variance in summary.
@gragusa Sorry, I'm not familiar with the methods you're talking about. But it seems to me that in many cases several variance estimation methods can be used with a given fitting method. The simplest example is OLS regression, which can be combined with asymptotic (the default), heteroskedasticy-robust/sandwich, jackknife, bootstrap, and several kinds of survey design-based methods.
Anyway, we're not saying you shouldn't be allowed to pass another method to summary
if needed: only that you should also be able to pass it to fit
. For models where it might not make sense, raising an error is fine.
Finally, I don't think you addressed @matthieugomez's point that for clustered standard errors, one needs to get rid of missing values on the clustering variable before the fit.
I had a vcov(obj::MyPkg.Struct, vce::AbstractVCE)
method in my package, but eventually moved to specifying it at the moment of building the model in part because the cluster variables are a defining feature. In order to request a different vcov one needs to refit the model, but as a benefit the implementation is a lot smoother. Basically packages only have to implement,
StatsBase.modelmatrix(obj::StatsBase.RegressionModel) =
error("modelmatrix is not defined for $(typeof(obj)).")
StatsBase.residuals(obj::StatsBase.RegressionModel) =
error("residuals is not defined for $(typeof(obj)).")
StatsBase.dof(obj::RegressionModel) =
error("dof is not defined for $(typeof(obj)).")
StatsBase.dof_residual(obj::RegressionModel) =
error("dof_residual is not defined for $(typeof(obj)).")
In addition, it has to provide the following methods,
getvce(obj::StatsBase.RegressionModel) = :OLS
function getclusters(obj::StatsBase.RegressionModel)
û = StatsBase.residuals(obj)
output = Matrix{Int64}(length(û),1)
output[:,1] = eachindex(û)
return output
end
function getΓ(obj::StatsBase.RegressionModel)
n = size(StatsBase.modelmatrix(obj), 2)
output = zeros(n, n)
return output
end
A variance covariance package should be able to compute any variance covariance estimator for linear models with that information. If packages save in their structs the inverse of the normal matrix or the pseudo-inverse (bread), these could be accessed and avoid computing the inverse twice. The API would just be two functions
function vcov!(obj::StatsBase.RegressionModel, fieldname::Symbol)
setfield(obj, fieldname, vcov(obj, vce(obj)))
end
function vce(obj::StatsBase.RegressionModel)
Bread = getbread(obj)
ũ = getũ(obj)
G = getG(obj)
Meat = ũ * ũ.' .* G
fsa = getγ(obj)
output = fsa * Bread * Meat * Bread.'
return output
end
@Nosferican The best way to show that this design suits people's needs is to put them in a package and write implementations using it for the various packages that can take advantage of it. (BTW, in Julia we usually omit the get
prefix to functions.)
Here is a prototype VarianceCovarianceEstimators.jl. It provides OLS, and the multi-way clustering generalized estimators for HC1, HC2, and HC3. It relies on the StatsBase.RegressionModel
methods.
Just a quote from a user that supports specifying standard errors in fit
"I don’t know if GLM.jl supports somehow directly providing the adjusted SEs, in which case this request would be obsolete. So far run regressions with GLM.fit() and adjust the SEs afterwards using CovarianceMatrices.jl. This is a bit tiresome, if you have many regressions."
https://discourse.julialang.org/t/ann-regressiontables-jl-produces-publication-quality-regression-tables/7516/19
I think between @matthieugomez, @lbittarello and I, we could probably commit to developing a solution à la
-
Expand the
DataFrames
+StatsBase
+StatsModels
+GLM
pipeline for a more robust solution for regression models -
Between conversations I think the following solution should work pretty good, but would be nice to get some feedback
- Utility package for various transformations and helpers:
- generalized within transformation (absorbs fixed effects and handles singletons)
- first-difference transformation
- between estimator
- two stage estimators
- subset linear independent predictors
- etc.
- Intermediate package for computing the distance (multi-way clustering, spatial, and temporal) and kernel auto-tuners for correlation structures which will then be used to provide Sandwich estimators (multi-way clustering, HAC (temporal, spatial, spatial and temporal), HC, etc.) with the necessary components
- The covariance matrices package for sandwich estimators and bootstrapping
- Regression Hypothesis Tests and Diagnostics: StatsBase will host Wald test, LR, and score tests. Hypothesis testing for various tests will construct the according hypothesis test (Wald test, robust Hausman, etc.)
- Utility package for various transformations and helpers:
Does that sound like a good plan?
@dmbates, would that work for MixedModels? For example, using the new StatsModels.Formula and passing (|) syntax to a dedicated parser.
Let me know what y'all think. I would say most of the features are implemented already across the board, but need to be organized.
Something I started experimenting with was to borrow the GLM framework for fitting the model and use it as a wrapper for the model matrix, model response, weights, etc. That way, the package would have access to GLM estimators, easier for CovarianceEstimators to be ported, etc.
Can we keep this discussion focused on the covariance issue? There was a previous discussion about tests at https://github.com/JuliaStats/HypothesisTests.jl/issues/121.
@Nosferican Can you describe changes that would be needed in StatsBase/StatsModels to suit your needs? Then others could comment on whether it works for them. Apart from the functions you described above, maybe we'd need LinearRegressionModel <: RegressionModel
?
See Microeconometrics. It requires models to define score
and jacobian
.
CovarianceMatrices already works for DataFramesRegressionModel
which exploits the various weights (i.e, wrkwts
) for computing the covariance estimates for GLM.
The alternative I am exploring right now is having the Regression package use GeneralizedLinearModel
to compute the wrkwts
, identify the model type by d
and l
, etc. If this approach works well then models could implement the accessors to the model <:RegressionModel
directly. Having GeneralizedLinearModel <: RegressionModel
is not a complete solution since the variance covariance estimators will still need to access more elements than just those in GeneralizedLinearModel
.
Estimators that can be computed as transformations to the linear predictor and the response can still use the framework by first computing the transformations and then passing these to the GeneralizedLinearModel
constructor. Instrumental estimators can extract the coef
from the model and use it with the appropriate model matrix to get the correct residuals. Currently there are some estimators such as Ridge regression which do not fit in the fit!(::GeneralizedLinearModel)
framework. It would be helpful to expand the framework to handle them. Don't know if GLM currently supports all kinds of weights
, but Microeconometrics.jl has put some work to handle those which could be ported to GLM.
As for support in other packages, there will be probably a CorrelationStructure package to handle computing distances for special dimensions (temporal: period, Date
, DateTime
, and spatial: coordinate system, datum, distance metric). Another package will probably provide and select the kernels based on the distance and data (unless one is specified). The covariance matrices package will lastly get all those inputs and return the estimates which can be incorporate into the model and packages (including GLM) will be able to store or display.
In summary, if we with score
and jacobian
then those should be established in StatsBase. If we decide to provide it based on a GLM framework, the work would be in GLM to get the weights working and make the fitting process more flexible so packages can use it more freely.
Support for linear models can be achieved already using the current hierarchy inheriting from RegressionModel
and using the prototype.
OK. I guess it's up to you, @dmbates, @matthieugomez and @lbittarello whether you want to base your packages on GLM or on something more generic living in StatsBase/StatsModels.
One advantage is that we can put more effort to make GLM as good as it can be. For example, rather than checking all possible failure cases in for logistic that can either be worked in the utilities package or even in GLM (e.g., handling linearly dependent variables). Either way, the dependency on GLM is soft. A package wouldn't actually need to use GLM (just behave like GLM). Specifying wrkresidwts(::RegressionModel)
would work as far as the package knows whether GLM was ever used or not. A hybrid could work just as well, for example computing the score
and jacobian
based on the GLM values. Suggestions are welcomed.
That is sort-of what I do in MixedModels
where a generalized linear mixed model (GLMM) is implemented in that package but using the GLM.GlmResp
type. If the truth be told, being able to do this is the reason that GlmResp
is defined separately from a GeneralizedLinearModel
.
Here's the rationale for score
and jacobian
:
The asymptotic variance of GMM estimators takes the form E (ψ Ω ψ'), where ψ is the influence function and Ω is the correlation structure. Moreover, ψ = (E J)⁻¹ s, where J is the Jacobian of the score vector (a.k.a. the Hessian matrix) and s is the score vector (i.e., the moment conditions). If we have the score and the Jacobian, we can compute all sorts of variances (White, clustered, etc.) by varying Ω.
GMM subsumes almost all parametric estimators in econometrics (as well as many semiparametric ones). (I'm not familiar with biostatistics and other fields.) For example, MLE and GLM are special cases. We can also interpret two-stage estimators (e.g., IPW) as special cases if both stages are parametric.
Therefore, basing the variance on score
and jacobian
gives us a lot of flexibility. We'd be able to cover models which don't fit into the GLM framework. And it should be easy to adapt GLM: CovarianceMatrices already computes the score (in its meat
function) and the Jacobian (in its bread
function) of LinPredModel
with minimal effort.
Is there any update on this?
Which extension in particular are you interested in? I am in the process of extending the API and suggestions are very useful. What is your use case?
On Thu, 14 Jun 2018 at 11:59, Ilja Kantorovitch [email protected] wrote:
Is there any update on this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaStats/GLM.jl/issues/42#issuecomment-397240963, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfRgoYYSy7rzkXFwIFgQY95rKGzBgbjks5t8jQLgaJpZM4BUdvk .
-- Giuseppe Ragusa
I don't need anything fancy, but to make my workflow work in Julia I need the following:
- Run a regression
Then either
2a. Specify during the regression already what kind of standard errors I need (for example HAC or HC - robust SEs) 2b. Be able to specify ex-post the standard errors I need, save it either to the object that is directly exported by GLM or have it in another vector
Most importantly then
- Be able to automatically export a regression table to latex with the e.g. HAC-robust standard errors/p-values/stars.
For now I do 1 -> 2b -> 3 in R.
Got vcov, see the new additions to StatsBase (information_matrix
and score
). GLM is pretty basic in terms of models, so it would be best to just provide the accessors and have the general vcov dependency. For exporting to \LaTeX, I suspect you want the CoefTable
. That could be implemented probably in StatsBase.
See also https://github.com/jmboehm/RegressionTables.jl.
@nalimilan Essentially there needs to be a way to combine the outputs of GLM.jl
and CovarianceMatrices.jl
and RegressionTables.jl
. I don't really know if the solution is to either
- integrate the functionality of
CovarianceMatrices.jl
inGLM.jl
or - allow
CovarianceMatrices.jl
to manipulate the objects thatGLM.jl
produces or - allow
RegressionTables.jl
to display custom standard errors, p-values etc.
I hope I made myself sufficiently clear. For illustration I post the R code that I am using:
# Return regression
regression_model <- lm(Y~ X1+ X2, data = data)
regression_model_2 <- lm(Y~ X1+ X3, data = data)
notes = c("\\parbox[t]{7cm}{" I am using custom standard errors"})
se_new = sqrt(diag(vcovHC(regression_model )))
se_new_2 = sqrt(diag(vcovHC(regression_model_2 )))
stargazer(regression_model, regression_model_2
out = paste("RegResults/Returns_growth" ,spec, ".tex", sep=""), title="Regression",
se = list(se_new, se_new_2) style = "qje",
covariate.labels = c("$\\Delta z_{0.75,t}$" , "$\\Delta \\sigma^{2}_{z,t}$"),
dep.var.caption = "", dep.var.labels = "Returns", float = FALSE,
df = FALSE, notes.append = FALSE, omit.stat = c("F","ser"),
notes=notes, notes.label = "", notes.align = "l")
I don't think that this is possible at the moment using the above-mentioned packages, but this is a very standard setup for me.
The Julia way is to avoid type-piracy. StatsBase provides the RegressionModel <: StatisticalModel
and coeftable
, CoefTable
. For vcov
, CovarianceMatrices should use the StatsBase API such that any RegressionModel
that implements the API can make use of it. The user interacts with whatever package and the package calls/depends on CovarianceMatrices internally. RegressionTables
can then implement the methods for StatsBase.CoefTable
. Any changes needed to CoefTable
for implementing those methods should be addressed as an issue in StatsBase.
The issue with the R implementation is that many times the vcovHC
for instance needs to be aware of things glm
does not keep track in order to work properly, (e.g., with fixed effects it would use the wrong dof_residual
). For CRVE estimates one needs access to the dimensions which are not available on a on-the-fly recompute the vce using a different HC. Similarly, for HAC, it is important to keep track of steps and gaps which glm doesn't understand. This leads to R's way needing some manual input which can be annoying and error-prone.