GLM.jl
GLM.jl copied to clipboard
Taking weighting seriously
This PR addresses several problems with the current GLM implementation.
Current status
In master, GLM/LM only accepts weights through the keyword wts. These weights are implicitly frequency weights.
With this PR FrequencyWeights, AnalyticWeights, and ProbabilityWeights are possible. The API is the following
## Frequency Weights
lm(@formula(y~x), df; wts=fweights(df.wts)
## Analytic Weights
lm(@formula(y~x), df; wts=aweights(df.wts)
## ProbabilityWeights
lm(@formula(y~x), df; wts=pweights(df.wts)
The old behavior -- passing a vector wts=df.wts is deprecated and for the moment, the array os coerced df.wts to FrequencyWeights.
To allow dispatching on the weights, CholPred takes a parameter T<:AbstractWeights. The unweighted LM/GLM has UnitWeights as the parameter for the type.
This PR also implements residuals(r::RegressionModel; weighted::Bool=false) and modelmatrix(r::RegressionModel; weighted::Bool = false). The new signature for these two methods is pending in StatsApi.
There are many changes that I had to make to make everything work. Tests are passing, but some new feature needs new tests. Before implementing them, I wanted to ensure that the approach taken was liked.
I have also implemented momentmatrix, which returns the estimating function of the estimator. I arrived to the conclusion that it does not make sense to have a keyword argument weighted. Thus I will amend https://github.com/JuliaStats/StatsAPI.jl/pull/16 to remove such a keyword from the signature.
Update
I think I covered all the suggestions/comments with this exception as I have to think about it. Maybe this can be addressed later. The new standard errors (the one for ProbabilityWeights) also work in the rank deficient case (and so does cooksdistance).
Tests are passing and I think they cover everything that I have implemented. Also, added a section in the documentation about using Weights and updated jldoc with the new signature of CholeskyPivoted.
To do:
- [X] Deal with weighted standard errors with rank deficient designs
- [X] Document the new API
- [X] Improve testing
Closes https://github.com/JuliaStats/GLM.jl/issues/186, https://github.com/JuliaStats/GLM.jl/issues/259.
Codecov Report
Patch coverage: 84.08% and project coverage change: -2.67 :warning:
Comparison is base (
c13577e) 90.48% compared to head (fa63a9a) 87.82%.
:exclamation: Current head fa63a9a differs from pull request most recent head 807731a. Consider uploading reports for the commit 807731a to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## master #487 +/- ##
==========================================
- Coverage 90.48% 87.82% -2.67%
==========================================
Files 8 8
Lines 1125 1191 +66
==========================================
+ Hits 1018 1046 +28
- Misses 107 145 +38
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/GLM.jl | 50.00% <ø> (-10.00%) |
:arrow_down: |
| src/glmfit.jl | 81.41% <79.80%> (-0.51%) |
:arrow_down: |
| src/lm.jl | 89.06% <82.35%> (-5.35%) |
:arrow_down: |
| src/glmtools.jl | 93.49% <85.71%> (-0.96%) |
:arrow_down: |
| src/linpred.jl | 91.41% <90.32%> (-6.92%) |
:arrow_down: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Hey,
Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?
I think the interfacing should allow for a DataFrame input of weights, that would take care of such things (like it does for the other variables).
Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of
df.weightsand then yells that there are too many weights ?
not really. But it would be easy to make this a feature. But before digging further on this I would like to know whether there is consensus on the approach of this PR.
FYI this appears to fix #420; a PR was started in #432 and the author closed for lack of time on their part to investigate CI failures.
Here's the test case pulled from #432 which passes with the in #487.
@testset "collinearity and weights" begin
rng = StableRNG(1234321)
x1 = randn(100)
x1_2 = 3 * x1
x2 = 10 * randn(100)
x2_2 = -2.4 * x2
y = 1 .+ randn() * x1 + randn() * x2 + 2 * randn(100)
df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50))
f = @formula(y ~ x1 + x2 + x3 + x4)
lm_model = lm(f, df, wts = df.weights)#, dropcollinear = true)
X = [ones(length(y)) x1_2 x2_2]
W = Diagonal(df.weights)
coef_naive = (X'W*X)\X'W*y
@test lm_model.model.pp.chol isa CholeskyPivoted
@test rank(lm_model.model.pp.chol) == 3
@test isapprox(filter(!=(0.0), coef(lm_model)), coef_naive)
end
Can this test set be added?
Is there any other feedback for @gragusa ? It would be great to get this merged if good to go.
Sorry for the long delay, I hadn't realized you were waiting for feedback. Looks great overall, please feel free to finish it! I'll try to find the time to make more specific comments.
A very nice PR. In the tests can we have some test set that compares the results of aweights, fweights, and pweights for the same set of data (coeffs, predictions, covariance matrix of the estimates, p-values etc.).
@nalimilan thanks for the careful review - I will make the suggestions part of the PR.
As for the testing - as suggested by @bkamins - probably we can use the same data to test everything with different weights. I will try to do this.
Mas for the QR: I want to add the pivoted QR - however I am not sure how to do this … it is easy to write a rank revealing pivoted qr decomposition, but I am not sure how to use to obtain the solution for the lin indep columns of the system.
Related to pivoting: something we should definitely add is the possibility for crossmatrix e modelmatrix to return the linear independent columns together with the indexes in the original full matrix. This would fix cookdistance (which only work form full rank designs) and also make momentmatrix useful for rank deficient. Is there any thoughts about this?
@nalimilan or somebody with GitHub fu, Accidentally sync my fork at gragusa/GLM.jl with JuliaStats/GLM:master. I have the commits in a branch gragusa/GLM.jl/JuliaStats-master. But if I push changes to this branch, the PR is not updated. Any help?
Are you sure? I see the same commits here and on the branch (Throw error if GlmResp are not AbastractWeights).
Mas for the QR: I want to add the pivoted QR - however I am not sure how to do this … it is easy to write a rank revealing pivoted qr decomposition, but I am not sure how to use to obtain the solution for the lin indep columns of the system.
I haven't investigated this at all, but shouldn't this be similar to how it works for Cholesky decomposition?
Related to pivoting: something we should definitely add is the possibility for crossmatrix e modelmatrix to return the linear independent columns together with the indexes in the original full matrix. This would fix cookdistance (which only work form full rank designs) and also make momentmatrix useful for rank deficient. Is there any thoughts about this?
That's tricky. Let's discuss this in a separate issue as we have enough on our plate here. ;-)
Are you sure? I see the same commits here and on the branch (Throw error if GlmResp are not AbastractWeights).
Yes. I can see the commit as well!
That's tricky. Let's discuss this in a separate issue as we have enough on our plate here. ;-)
Ok. I will work on it, once I tame this monster here!
I updated the documentation, but while it is correctly built locally, I see failures on checks here.
The reason is that the type signature for GLM and LM differ from the one I am getting locally. Locally I am using Julia 1.7 - the doc GitHub action probably uses Julia 1.8. Is it possible that type signature are different between versions? I will try later to build locally with 1.8 and update the doc in case that’s the reason.
Is it possible that type signature are different between versions?
Yes, it is possible. Output changes are not considered breaking between Julia versions (but I have not checked if this is the case here; another common reason of such problems arises if locally you have a different set of packages loaded than on CI - then the prefixes of types might change)
No itis not the prefix, is the CholPred signature output is different in the CI than locally and btw, the output seems to be wrong and messing with the curly brackets. But I am checking this carefully
Get Outlook for iOShttps://aka.ms/o0ukef
From: Bogumił Kamiński @.> Sent: Thursday, September 8, 2022 1:48:45 PM To: JuliaStats/GLM.jl @.> Cc: Giuseppe Ragusa @.>; Mention @.> Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487)
Is it possible that type signature are different between versions?
Yes, it is possible. Output changes are not considered breaking between Julia versions (but I have not checked if this is the case here; another common reason of such problems arises if locally you have a different set of packages loaded than on CI - then the prefixes of types might change)
— Reply to this email directly, view it on GitHubhttps://github.com/JuliaStats/GLM.jl/pull/487#issuecomment-1240609321, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAD5DAT4O6J2F3FSL2OCPT3V5HHB3ANCNFSM53WBWMMQ. You are receiving this because you were mentioned.Message ID: @.***>
Now I have looked at the output - it seems that the order of 2 type parameters is reversed.
Now I have looked at the output - it seems that the order of 2 type parameters is reversed.
What happened is that CholeskyPivoted changed the signature between versions
-
1.7
CholeskyPivoted{Float64, Matrix{Float64}} -
1.8
CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}
@nalimilan what do you think about this?
@nalimilan I committed your suggested changes. I see you bumped several review comments that i had pervasively addressed (or answered). I am going to report these here:
-
https://github.com/JuliaStats/GLM.jl/pull/487#discussion_r985297147 Why
dof_residualchanged type ? BecauseUnitWeightsis ofeltypeInt(previously no weighting meant an emptyfloatarray. -
https://github.com/JuliaStats/GLM.jl/pull/487#discussion_r985297465 What is the question you are referring to?
-
https://github.com/JuliaStats/GLM.jl/pull/487#discussion_r985297566
and other I think.
As far as the general PR: I am very very puzzled by the loglikelihood, deviance and nulldeviance. I tried R and stata and for weights they all give different conflicting results. I have to think about this.....I cannot see the code since R is GPL and Stata is proprietary.
Ah sorry the GitHub UI doesn't show the threads in which I added the comments. These are: https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r958885835 https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978083032 https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978061661 https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978063134
- Why
dof_residualchanged type ? BecauseUnitWeightsis ofeltypeInt(previously no weighting meant an emptyfloatarray.
Ah, I see, I think I originally defined nobs to always return a float to make the function type-stable, as the presence/absence of weights was not reflected in the model type. Now this is OK thanks to weight vector types.
As far as the general PR: I am very very puzzled by the
loglikelihood,devianceandnulldeviance. I triedRandstataand forweightsthey all give different conflicting results. I have to think about this.....I cannot see the code since R is GPL and Stata is proprietary.
Can you point me to an example of a problematic model? I can have a look, including at the R code, since I'm not the one writing the implementation.
https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978083032
Isn't this equivalent?
Suggested change
X = modelmatrix(obj; weighted=isweighted(obj))
if k == size(X,2)
XtX = crossmodelmatrix(obj; weighted=isweighted(obj))
X = modelmatrix(obj; weighted=true)
if k == size(X, 2)
XtX = crossmodelmatrix(obj; weighted=true)
modelmatrix is implemented as
function modelmatrix(obj::LinPredModel; weighted::Bool=false)
if isweighted(obj)
mul!(obj.pp.scratchm1, Diagonal(sqrt.(obj.pp.wts)), obj.pp.X)
else
obj.pp.X
end
end
so the keyword argument does not do much.... should I define it as
function modelmatrix(obj::LinPredModel; weighted::Bool=isweighted(obj))
if weighted
mul!(obj.pp.scratchm1, Diagonal(sqrt.(obj.pp.wts)), obj.pp.X)
else
obj.pp.X
end
end
After this change they will be equivalent....however, for the unweighted model, is better to specify weighted=FALSE to avoid a matrix multiplication.
https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978083032
For the standard case (homoskedastic errors), invchol returns a (p x p) matrix with NaN for coefficients corresponding to the collinear rows. The formula is $\hat{sigma}^2 (X'X)^{-1}$ and we will get NaN for the varaince/stderrors of these coefficients. For the weighted case, the formula for the variance is $(X'X)^{-1} \sum_{i=1}^n X_i'u_iu_i'X_i (X'X)^{-1}$, without the machinery NaN will propagate through the matrix products resulting in a NaN MATRIX. The machinery removes the NaN elements, computes the matrix products, and then expands the matrix back to its original size. I hope it makes sense.
https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978061661 ...about performance for the unweighted case.
Committed a change
https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r958885835
Will take care of this.....needs testing
Likelihood issue
The likelihood (or quasi-likelihood) used for fitting in all three cases is the same - that's why we get the same point estimates.
The problem is that we can sometimes define post-estimation likelihood that (a little bit ad hoc) try to take into account the nature of the weights. Being ad hoc, they are arbitrary. See this document I made here html. You can see how Stata numbers are not the same with those of R and those of this PR.
Personally, I would define the likelihood in the same way for all weights - more transparent. If we want to define different likelihoods then we have to carefully document what they are (at the moment is not clear what they are in Stata - I know what they are in R).
After this change they will be equivalent....however, for the unweighted model, is better to specify
weighted=FALSEto avoid a matrix multiplication.
I'd use if weighted && isweighted(x) to avoid the unnecessary multiplication.
For the standard case (homoskedastic errors),
invcholreturns a (p x p) matrix withNaNfor coefficients corresponding to the collinear rows. The formula is...
OK, thanks for the details.
The likelihood (or quasi-likelihood) used for fitting in all three cases is the same - that's why we get the same point estimates.
The problem is that we can sometimes define post-estimation likelihood that (a little bit ad hoc) try to take into account the nature of the weights. Being ad hoc, they are arbitrary. See this document I made here html. You can see how Stata numbers are not the same with those of R and those of this PR.
Personally, I would define the likelihood in the same way for all weights - more transparent. If we want to define different likelihoods then we have to carefully document what they are (at the moment is not clear what they are in Stata - I know what they are in
R).
AFAICT we can't use the same likelihood definition for all weights types, as they need to be scale-free for probability weights, right? One criterion I can see to choose how to define the likelihood is that the McFadden pseudo-R² for normal family with identity link should be equal to the R² of the equivalent linear model. Does that make sense?
AFAICT we can't use the same likelihood definition for all weights types, as they need to be scale-free for probability weights, right? One criterion I can see in choosing how to define the likelihood is that the McFadden pseudo-R² for a normal family with an identity link should be equal to the R² of the equivalent linear model. Does that make sense?
They all use the same likelihood up to a normalizing constant (which as you correctly pointed out depends on the scale of the weights). Also, the deviance is the same regardless of the weight type, which is what we have (consistent with R). The deviance is twice the difference between the loglik and the null loglik. Thus, the differences between these likelihoods are due to constant scaling, which would not matter when using the likelihood for comparison.
For instance, R survey package seems to calculate the likelihood as (minus) half the ~likelihood~ deviance (with a deviance that is equal to GLM.jl and R with analytic weights).
For instance,
Rsurveypackage seems to calculate the likelihood as (minus) half the likelihood (with a deviance that is equal toGLM.jlandRwith analytic weights).
You mean "half the deviance", right? Indeed I just checked that it's the case with svyglm.
But yeah now I get that the constant factor goes away when computing the ratio of likelihoods so requiring the McFadden pseudo-R² to equal the R² isn't enough to choose a definition.
However, the definition used by svyglm doesn't seem to work well with other pseudo-R², as for example the Cox-Snell pseudo-R² 1 - exp(2 * (logLik(nullmodel) - logLik(model)) / nobs(model)) always seem to return 1 in my tests. This means that fitting a model with probability weights all equal to one doesn't give the same Cox-Snell pseudo-R² as fitting an unweighted model. That's problematic, right?
BTW, the R survey package is intended mainly for complex designs (strata, etc.), which means that most standard statistics don't work anyway. This may have lead the package to diverge from standard functions (e.g. logLik prints a warning). Since our scope is more limited here (no complex sampling) we might better use different definitions. Not sure.
However, the definition used by
svyglmdoesn't seem to work well with other pseudo-R², as for example the Cox-Snell pseudo-R²1 - exp(2 * (logLik(nullmodel) - logLik(model)) / nobs(model))always seem to return 1 in my tests. This means that fitting a model with probability weights all equal to one doesn't give the same Cox-Snell pseudo-R² as fitting an unweighted model. That's problematic, right?
If the deviance is given by $$D = -2*(logLik(model)-logLik(nullmodel))$$
Then,
$$pseudo-R^{2} = 1 - exp(D)/ nobs(model),$$ which is again equal regardless of the weighting type. I believe the differences between software are due to different implementations and choices of the normalization constant. For instance, STATA for aweights reports deviance using rescaled weights $\bar{w}i = w_i - \frac{\sum{i=1}^n w_i}{n}$.
Note that the relation $$D = -2*(logLik(model)-logLik(nullmodel))$$ does not actually hold for even unweighted models (see https://github.com/JuliaStats/GLM.jl/pull/115#discussion_r45355268). And the pseudo-R² is defined in terms of log-likelihood, not in terms of the deviance.
In R, with svyglm, the deviance is the same when fitting an unweighted model and when fitting a model will all weights equal. But the log-likelihood is completely different, since R's glm (and GLM.jl) doesn't define the log-likelihood as $-D/2$. So I don't think we can adopt the definition from svyglm.
Don't you agree that the log-likelihood should be the same when fitting a model without weights and when fitting a model with probability weights being all equal? That seems to imply we should choose the right normalization of weights to ensure that property. A simple solution would be to normalize them to sum to the number of observations. Then we could use the same formula to compute the likelihood as for frequency weights.
For analytic weights the situation is more tricky as the definition isn't completely clear in Stata nor in StatsBase (see https://github.com/JuliaStats/StatsBase.jl/issues/758). But it seems that at least when all weights are equal to one we should get the same results as an unweighted model. Cc: @ParadaCarleton
For analytic weights the situation is more tricky as the definition isn't completely clear in Stata nor in StatsBase (see JuliaStats/StatsBase.jl#758). But it seems that at least when all weights are equal to one we should get the same results as an unweighted model. Cc: @ParadaCarleton
I outlined a response there--sorry I forgot about it. Similar comments apply to GLM.jl: I don't think it's necessarily possible to create a single method for glm that deals with analytic weights, unless lots and lots of keywords are added to supply additional information. The problem with analytic weights is there's a lot of assumptions that go into a good meta-analysis. Questions like whether the means and variances can be considered homogenous, or whether there's possibly publication bias, and about which estimators are best (as the choice is highly nontrivial when the data has been pre-aggregated).
At some point, someone will probably have to create a Julia package for meta-analysis, but that will require lots of hard work and research by a professional statistician--probably more than we can fit into a single pull request for GLM.jl.
Thanks for commenting. Though in other software (R, Stata...) it's standard to use weighted least squares estimation with analytic weights. For example, in R, ?lm says:
Non-‘NULL’ ‘weights’ can be used to indicate that different observations have different variances (with the values in ‘weights’ being inversely proportional to the variances); or equivalently, when the elements of ‘weights’ are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (including the case that there are w_i observations equal to y_i and the data have been summarized). However, in the latter case, notice that within-group variation is not used. Therefore, the sigma estimate and residual degrees of freedom may be suboptimal; in the case of replication weights, even wrong. Hence, standard errors and analysis of variance tables should be treated with care.
Do you think we should provide a way to get the same estimates as R, or is that definition too broken to be useful? If so, with which weights type?
If that's too tricky we could support only UnitWeights, FrequencyWeights and ProbablityWeights for now as the PR is already large.
The issue in StatsBase.jl and this PR are pretty unrelated. No matter what weights are used, the estimate coefficients (the point estimates) will be the same. They all maximize $$\sum_{i=1}^n w_i \ell(\theta),$$ where $\ell(\theta)$ is the log-likelihood under the model that one will maximize if weights were not provided.
With ProbabilityWeights the variance is different. There is also the issue of what likelihood to report. Again, under ProbabilityWeights the log-likelihood being maximized is a quasi-likelihood, and different software reports different log-likelihood.
~AnaliticWeights~ ImportanceWeights are the most commonly used because they have less structure and are helpful for those cases where the weights are not related to sampling. For instance, in specific applications in labor economics, one weights the observations by hours worked by workers. Or in propensity score-based causal inference, the weights are the probability of being treated.
So we can keep splitting hairs about the weights and their definition, but I think it wold add only complexity. R admits only ~AnaliticWeights~ ImportanceWeights for lm and glm. Other more specialized packages then use glm and lm with ~AnaliticWeights~ ImportanceWeights: the survey package underneath uses glm for svyglm with ~analytic~ importance weights.