GLM.jl icon indicating copy to clipboard operation
GLM.jl copied to clipboard

Allow sandwich/bootstrap/jackknife standard errors as options

Open nalimilan opened this issue 11 years ago • 35 comments

This is more a feature request or policy question than a bug report.

I'm wondering whether you would like to add an argument allowing to easily compute sandwich (heteroskedasticity-robust), bootstrap, jackknife and possibly other types of variance-covariance matrix and standard errors, instead of the asymptotic ones. Or whether you would like to provide a function to compute these after the model has been fitted.

R makes it relatively cumbersome to get these common standard errors (see e.g. [1] or [2]), which either means that researchers with limited programming skills are incited to keep using Stata, or to keep relying only on asymptotic standard errors. I see no good reason why such an option shouldn't be provided in the standard modeling package.

1: http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-bootstrapping.pdf 2: http://cran.r-project.org/web/packages/sandwich/vignettes/sandwich-OOP.pdf

nalimilan avatar Dec 16 '13 16:12 nalimilan

On this issue a I have written a basic package to calculate autocorrelation and heteroskedasticity robust standard error https://github.com/gragusa/VCOV.jl.

There are methods that work for GLM.jl models (at the moment tested only for models without weights).

I am planing to finish it in the next couple of weeks. However, it would be nice to have some infrastructure setup at the StatsBase level.

For instance, for models such that the parameters solve \sum_{i=1}^n g(W_i, \hat{\beta}) = 0, we could think of methods (for StatisticalModel type) that returns, \sum_{i=1}^n g(W_i, \hat{\beta}) and \sum_{i=1}^n \partial g(W_i, \hat{\beta})/\partial \beta'. This would make immediately available HAC type standard errors available for all models.

For showing the output would be only a matter to extend the show method to accept an argument defining the type of variance.

gragusa avatar Dec 02 '14 16:12 gragusa

@gragusa I had missed your comment, but it sounds cool. I guess open an issue against StatsBase with suggestions about the API to discuss this.

nalimilan avatar Mar 06 '15 14:03 nalimilan

I wrote a (registered) package CovarianceMatrices.jl that is already pretty exhaustive in terms of features, but I feel the api can be further improved. On Fri 6 Mar 2015 at 15:29 Milan Bouchet-Valat [email protected] wrote:

@gragusa https://github.com/gragusa I had missed your comment, but it sounds cool. I guess open an issue against StatsBase with suggestions about the API to discuss this.

— Reply to this email directly or view it on GitHub https://github.com/JuliaStats/GLM.jl/issues/42#issuecomment-77566720.

gragusa avatar Mar 06 '15 14:03 gragusa

I also think one should be able to specify standard errors in fit (rather than using a vcov function on the result of fit as in R) for several reasons: (i) The estimates may depend on the method chosen to specify errors. In particular, if I specify clustered errors wrt some variable, and this variable is missing for some observations, the estimate should be computed based only on those observations. This looks like an weird case but this happens in a lot with the data I'm working with (ii) more friendly API for the user (one gets the relevant summary table right away). From a simple google search, the issue is discussed here, here, here and here, etc. (iii) faster (since a lot of computations are generally redundant between estimates and error estimations) (iv) potentially lighter output of regression results since models don't need to carry every matrix to compute errors (it matters when working with datasets > Go, and obviously when working with datasets out of memory). The huge size of lm and glm models in R is discussed here, here, here here (and for absurd consequences, here and there).

A nice solution, which relies on multiple dispatch, would be to define an AbstractVcov type, and use it in the signature of fit. Any child type would then a particular way to compute errors. For instance, one child type is VcovCluster. An instance of the type can be constructed withVcovCluster(:State), where State is the variable I want to cluster on.

Errors method could then be specified as an argument to fit

fit(LinearModel, formula, df, VceCluster(:State))
fit(LinearModel, formula, df, VceHAC(:Year, lag = 2, kernell = Bartlett))

I've tried to do such an implementation in my fixed effects package

matthieugomez avatar Jun 24 '15 23:06 matthieugomez

Interesting arguments.

The interface you suggest sounds good, but what about the implementation? How, when and with what arguments should the AbstractVcov constructor be called, so that we can handle not only clustered standard errors, but also e.g. various kinds of bootstrap? I guess for bootstrap the fitting method should pass a function which would be called again to fit the replicates. This might deserve some common infrastructure in StatsBase, as the procedure is essentially similar for many kinds of models.

nalimilan avatar Jun 25 '15 12:06 nalimilan

I think in principle having the fitting method dispatch on AbstractVcov is a good idea, but it is a nightmare from a implementation point of view. Also, you should have also Vcov methods for the fitted model (for instance you estimate GLM with a Vcov and then you want to see how the standard errors change under different assumptions on the covariance of the errors. But this would amounts of having lots of duplication.

That's why in CovarianceMatrices I went the other way. I agree that for bootstrap we can have a common infrastructure.

I think he would be enough to have a summary object that dispatch on the RobustVariance.

Giuseppe Ragusa

On Thu, Jun 25, 2015 at 2:04 PM, Milan Bouchet-Valat < [email protected]> wrote:

Interesting arguments.

The interface you suggest sounds good, but what about the implementation? How, when and with what arguments should the AbstractVcov constructor be called, so that we can handle not only clustered standard errors, but also e.g. various kinds of bootstrap? I guess for bootstrap the fitting method should pass a function which would be called again to fit the replicates. This might deserve some common infrastructure in StatsBase, as the procedure is essentially similar for many kinds of models.

— Reply to this email directly or view it on GitHub https://github.com/JuliaStats/GLM.jl/issues/42#issuecomment-115224341.

gragusa avatar Jun 25 '15 12:06 gragusa

@gragusa Why do you think there should be code duplication? The AbstractVcov constructor could be passed the same arguments during model fitting as after (i.e. when called on the resulting object).

The problems with only choosing the standard errors type when calling summary have been detailed by @matthieugomez. It doesn't look like it would work very well for many cases.

nalimilan avatar Jun 25 '15 12:06 nalimilan

Exactly, computing errors as part of the fit does not require code duplication. I actually also define a vcov method in my package. The only difference with your package is that my vcov function (i) is not directly called by the user (ii) acts on a hat matrix, a matrix of regressor, and a vector of residuals, rather than the result of a glm model.

In fact, every type child of AbstractVcov has two methods, a allvars (that get all the variables needed to estimate the errors, which makes sure missing rows are removed before the estimation), and a vcov method

The fit function in just written

function fit(formula, df, vcov_method::AbstractVcov)
...
# instead of the df[complete_cases(df[allvars(formula)] step in ModelFrame
df[complete_cases(df[allvars(formula), allvars(vcov_method])
.....
# instead of returning the usual covariance matrix
vcov(vcov_method, df, Xhat, X, residuals)
end

This is actually similar to how Stata work: the avar package in Stata acts on matrices rather than models & is used under the hood by very different fit models https://ideas.repec.org/c/boc/bocode/s457689.html

matthieugomez avatar Jun 25 '15 13:06 matthieugomez

The problem is functional redundancy (for luck of a better word). In stata you have to refit the model to get new standard error. In julia you could

reg_1 = fit(::formula, df, ::AbstractVcov)

and then have

vcov(reg_1, ::AbstractVcov)

to get new standard errors. So you have two different interfaces to do the same thing.

gragusa avatar Jun 25 '15 13:06 gragusa

I don't see why that would be an issue. It works the same in R. Also, these are not really the same since vcov only gives you the matrix, while passing the argument to fit would mean all actions on the model would use the chosen standard errors: summary, confidence intervals, plots...

nalimilan avatar Jun 25 '15 13:06 nalimilan

Thinking more about it, I do not agree to include variance information in the fit method. The reason is that logically the variance info should be relevant to fit, but here it is only relevant to the summary.

If the variance of the error vector is not a multiple of the identity matrix, then the efficient estimator is different. So I would expect a different fitting method.

Conceptually, i would expect

glm_1 = fit(::GLM, df, ::WorkingVar) 

and WorkingVar would be, for instance, Exchangeable, AR. In this case I would expect a fitting by Estimating Equations. Then, even in this case, I could ask what the Robust variance is

summary(glm_1, ::AbstractVcov)

What i am trying to say is that a variance in fit is logically a different object than a variance in summary.

gragusa avatar Jun 25 '15 14:06 gragusa

@gragusa Sorry, I'm not familiar with the methods you're talking about. But it seems to me that in many cases several variance estimation methods can be used with a given fitting method. The simplest example is OLS regression, which can be combined with asymptotic (the default), heteroskedasticy-robust/sandwich, jackknife, bootstrap, and several kinds of survey design-based methods.

Anyway, we're not saying you shouldn't be allowed to pass another method to summary if needed: only that you should also be able to pass it to fit. For models where it might not make sense, raising an error is fine.

Finally, I don't think you addressed @matthieugomez's point that for clustered standard errors, one needs to get rid of missing values on the clustering variable before the fit.

nalimilan avatar Jun 26 '15 07:06 nalimilan

I had a vcov(obj::MyPkg.Struct, vce::AbstractVCE) method in my package, but eventually moved to specifying it at the moment of building the model in part because the cluster variables are a defining feature. In order to request a different vcov one needs to refit the model, but as a benefit the implementation is a lot smoother. Basically packages only have to implement,

StatsBase.modelmatrix(obj::StatsBase.RegressionModel) =
    error("modelmatrix is not defined for $(typeof(obj)).")
StatsBase.residuals(obj::StatsBase.RegressionModel) =
    error("residuals is not defined for $(typeof(obj)).")
StatsBase.dof(obj::RegressionModel) =
    error("dof is not defined for $(typeof(obj)).")
StatsBase.dof_residual(obj::RegressionModel) =
    error("dof_residual is not defined for $(typeof(obj)).")

In addition, it has to provide the following methods,

getvce(obj::StatsBase.RegressionModel) = :OLS
function getclusters(obj::StatsBase.RegressionModel)
    û = StatsBase.residuals(obj)
    output = Matrix{Int64}(length(û),1)
    output[:,1] = eachindex(û)
    return output
end
function getΓ(obj::StatsBase.RegressionModel)
    n = size(StatsBase.modelmatrix(obj), 2)
    output = zeros(n, n)
    return output
end

A variance covariance package should be able to compute any variance covariance estimator for linear models with that information. If packages save in their structs the inverse of the normal matrix or the pseudo-inverse (bread), these could be accessed and avoid computing the inverse twice. The API would just be two functions

function vcov!(obj::StatsBase.RegressionModel, fieldname::Symbol)
	setfield(obj, fieldname, vcov(obj, vce(obj)))
end
function vce(obj::StatsBase.RegressionModel)
	Bread = getbread(obj)
	ũ = getũ(obj)
	G = getG(obj)
	Meat = ũ * ũ.' .* G
	fsa = getγ(obj)
	output = fsa * Bread * Meat * Bread.'
	return output
end

Nosferican avatar Nov 14 '17 02:11 Nosferican

@Nosferican The best way to show that this design suits people's needs is to put them in a package and write implementations using it for the various packages that can take advantage of it. (BTW, in Julia we usually omit the get prefix to functions.)

nalimilan avatar Nov 14 '17 07:11 nalimilan

Here is a prototype VarianceCovarianceEstimators.jl. It provides OLS, and the multi-way clustering generalized estimators for HC1, HC2, and HC3. It relies on the StatsBase.RegressionModel methods.

Nosferican avatar Nov 27 '17 23:11 Nosferican

Just a quote from a user that supports specifying standard errors in fit

"I don’t know if GLM.jl supports somehow directly providing the adjusted SEs, in which case this request would be obsolete. So far run regressions with GLM.fit() and adjust the SEs afterwards using CovarianceMatrices.jl. This is a bit tiresome, if you have many regressions."

https://discourse.julialang.org/t/ann-regressiontables-jl-produces-publication-quality-regression-tables/7516/19

matthieugomez avatar Dec 12 '17 16:12 matthieugomez

I think between @matthieugomez, @lbittarello and I, we could probably commit to developing a solution à la

  • Expand the DataFrames + StatsBase + StatsModels + GLM pipeline for a more robust solution for regression models

  • Between conversations I think the following solution should work pretty good, but would be nice to get some feedback

    • Utility package for various transformations and helpers:
      • generalized within transformation (absorbs fixed effects and handles singletons)
      • first-difference transformation
      • between estimator
      • two stage estimators
      • subset linear independent predictors
      • etc.
    • Intermediate package for computing the distance (multi-way clustering, spatial, and temporal) and kernel auto-tuners for correlation structures which will then be used to provide Sandwich estimators (multi-way clustering, HAC (temporal, spatial, spatial and temporal), HC, etc.) with the necessary components
    • The covariance matrices package for sandwich estimators and bootstrapping
    • Regression Hypothesis Tests and Diagnostics: StatsBase will host Wald test, LR, and score tests. Hypothesis testing for various tests will construct the according hypothesis test (Wald test, robust Hausman, etc.)

Does that sound like a good plan?

@dmbates, would that work for MixedModels? For example, using the new StatsModels.Formula and passing (|) syntax to a dedicated parser.

Let me know what y'all think. I would say most of the features are implemented already across the board, but need to be organized.

Something I started experimenting with was to borrow the GLM framework for fitting the model and use it as a wrapper for the model matrix, model response, weights, etc. That way, the package would have access to GLM estimators, easier for CovarianceEstimators to be ported, etc.

Nosferican avatar Dec 12 '17 17:12 Nosferican

Can we keep this discussion focused on the covariance issue? There was a previous discussion about tests at https://github.com/JuliaStats/HypothesisTests.jl/issues/121.

@Nosferican Can you describe changes that would be needed in StatsBase/StatsModels to suit your needs? Then others could comment on whether it works for them. Apart from the functions you described above, maybe we'd need LinearRegressionModel <: RegressionModel?

nalimilan avatar Dec 13 '17 09:12 nalimilan

See Microeconometrics. It requires models to define score and jacobian. CovarianceMatrices already works for DataFramesRegressionModel which exploits the various weights (i.e, wrkwts) for computing the covariance estimates for GLM.

The alternative I am exploring right now is having the Regression package use GeneralizedLinearModel to compute the wrkwts, identify the model type by d and l, etc. If this approach works well then models could implement the accessors to the model <:RegressionModel directly. Having GeneralizedLinearModel <: RegressionModel is not a complete solution since the variance covariance estimators will still need to access more elements than just those in GeneralizedLinearModel.

Estimators that can be computed as transformations to the linear predictor and the response can still use the framework by first computing the transformations and then passing these to the GeneralizedLinearModel constructor. Instrumental estimators can extract the coef from the model and use it with the appropriate model matrix to get the correct residuals. Currently there are some estimators such as Ridge regression which do not fit in the fit!(::GeneralizedLinearModel) framework. It would be helpful to expand the framework to handle them. Don't know if GLM currently supports all kinds of weights, but Microeconometrics.jl has put some work to handle those which could be ported to GLM.

As for support in other packages, there will be probably a CorrelationStructure package to handle computing distances for special dimensions (temporal: period, Date, DateTime, and spatial: coordinate system, datum, distance metric). Another package will probably provide and select the kernels based on the distance and data (unless one is specified). The covariance matrices package will lastly get all those inputs and return the estimates which can be incorporate into the model and packages (including GLM) will be able to store or display.

In summary, if we with score and jacobian then those should be established in StatsBase. If we decide to provide it based on a GLM framework, the work would be in GLM to get the weights working and make the fitting process more flexible so packages can use it more freely.

Support for linear models can be achieved already using the current hierarchy inheriting from RegressionModel and using the prototype.

Nosferican avatar Dec 13 '17 09:12 Nosferican

OK. I guess it's up to you, @dmbates, @matthieugomez and @lbittarello whether you want to base your packages on GLM or on something more generic living in StatsBase/StatsModels.

nalimilan avatar Dec 13 '17 09:12 nalimilan

One advantage is that we can put more effort to make GLM as good as it can be. For example, rather than checking all possible failure cases in for logistic that can either be worked in the utilities package or even in GLM (e.g., handling linearly dependent variables). Either way, the dependency on GLM is soft. A package wouldn't actually need to use GLM (just behave like GLM). Specifying wrkresidwts(::RegressionModel) would work as far as the package knows whether GLM was ever used or not. A hybrid could work just as well, for example computing the score and jacobian based on the GLM values. Suggestions are welcomed.

Nosferican avatar Dec 13 '17 09:12 Nosferican

That is sort-of what I do in MixedModels where a generalized linear mixed model (GLMM) is implemented in that package but using the GLM.GlmResp type. If the truth be told, being able to do this is the reason that GlmResp is defined separately from a GeneralizedLinearModel.

dmbates avatar Dec 14 '17 21:12 dmbates

Here's the rationale for score and jacobian:

The asymptotic variance of GMM estimators takes the form E (ψ Ω ψ'), where ψ is the influence function and Ω is the correlation structure. Moreover, ψ = (E J)⁻¹ s, where J is the Jacobian of the score vector (a.k.a. the Hessian matrix) and s is the score vector (i.e., the moment conditions). If we have the score and the Jacobian, we can compute all sorts of variances (White, clustered, etc.) by varying Ω.

GMM subsumes almost all parametric estimators in econometrics (as well as many semiparametric ones). (I'm not familiar with biostatistics and other fields.) For example, MLE and GLM are special cases. We can also interpret two-stage estimators (e.g., IPW) as special cases if both stages are parametric.

Therefore, basing the variance on score and jacobian gives us a lot of flexibility. We'd be able to cover models which don't fit into the GLM framework. And it should be easy to adapt GLM: CovarianceMatrices already computes the score (in its meat function) and the Jacobian (in its bread function) of LinPredModel with minimal effort.

lbittarello avatar Dec 15 '17 00:12 lbittarello

Is there any update on this?

IljaK91 avatar Jun 14 '18 09:06 IljaK91

Which extension in particular are you interested in? I am in the process of extending the API and suggestions are very useful. What is your use case?

On Thu, 14 Jun 2018 at 11:59, Ilja Kantorovitch [email protected] wrote:

Is there any update on this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaStats/GLM.jl/issues/42#issuecomment-397240963, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfRgoYYSy7rzkXFwIFgQY95rKGzBgbjks5t8jQLgaJpZM4BUdvk .

-- Giuseppe Ragusa

gragusa avatar Jun 14 '18 10:06 gragusa

I don't need anything fancy, but to make my workflow work in Julia I need the following:

  1. Run a regression

Then either

2a. Specify during the regression already what kind of standard errors I need (for example HAC or HC - robust SEs) 2b. Be able to specify ex-post the standard errors I need, save it either to the object that is directly exported by GLM or have it in another vector

Most importantly then

  1. Be able to automatically export a regression table to latex with the e.g. HAC-robust standard errors/p-values/stars.

For now I do 1 -> 2b -> 3 in R.

IljaK91 avatar Jun 14 '18 11:06 IljaK91

Got vcov, see the new additions to StatsBase (information_matrix and score). GLM is pretty basic in terms of models, so it would be best to just provide the accessors and have the general vcov dependency. For exporting to \LaTeX, I suspect you want the CoefTable. That could be implemented probably in StatsBase.

Nosferican avatar Jun 14 '18 11:06 Nosferican

See also https://github.com/jmboehm/RegressionTables.jl.

nalimilan avatar Jun 14 '18 12:06 nalimilan

@nalimilan Essentially there needs to be a way to combine the outputs of GLM.jl and CovarianceMatrices.jl and RegressionTables.jl. I don't really know if the solution is to either

  1. integrate the functionality of CovarianceMatrices.jl in GLM.jl or
  2. allow CovarianceMatrices.jl to manipulate the objects that GLM.jl produces or
  3. allow RegressionTables.jl to display custom standard errors, p-values etc.

I hope I made myself sufficiently clear. For illustration I post the R code that I am using:

# Return regression
regression_model    <- lm(Y~ X1+ X2, data = data)
regression_model_2 <- lm(Y~ X1+ X3, data = data)

notes = c("\\parbox[t]{7cm}{" I am using custom standard errors"})

se_new = sqrt(diag(vcovHC(regression_model )))
se_new_2 = sqrt(diag(vcovHC(regression_model_2 )))

stargazer(regression_model, regression_model_2 
          out = paste("RegResults/Returns_growth" ,spec, ".tex", sep=""), title="Regression", 
          se = list(se_new, se_new_2) style = "qje",
          covariate.labels = c("$\\Delta z_{0.75,t}$" , "$\\Delta \\sigma^{2}_{z,t}$"),
          dep.var.caption = "", dep.var.labels = "Returns", float = FALSE,
          df = FALSE, notes.append = FALSE, omit.stat = c("F","ser"),
          notes=notes, notes.label = "", notes.align = "l")

I don't think that this is possible at the moment using the above-mentioned packages, but this is a very standard setup for me.

IljaK91 avatar Jun 14 '18 13:06 IljaK91

The Julia way is to avoid type-piracy. StatsBase provides the RegressionModel <: StatisticalModel and coeftable, CoefTable. For vcov, CovarianceMatrices should use the StatsBase API such that any RegressionModel that implements the API can make use of it. The user interacts with whatever package and the package calls/depends on CovarianceMatrices internally. RegressionTables can then implement the methods for StatsBase.CoefTable. Any changes needed to CoefTable for implementing those methods should be addressed as an issue in StatsBase.


The issue with the R implementation is that many times the vcovHC for instance needs to be aware of things glm does not keep track in order to work properly, (e.g., with fixed effects it would use the wrong dof_residual). For CRVE estimates one needs access to the dimensions which are not available on a on-the-fly recompute the vce using a different HC. Similarly, for HAC, it is important to keep track of steps and gaps which glm doesn't understand. This leads to R's way needing some manual input which can be annoying and error-prone.

Nosferican avatar Jun 14 '18 13:06 Nosferican