GLM.jl icon indicating copy to clipboard operation
GLM.jl copied to clipboard

Taking weighting seriously

Open gragusa opened this issue 3 years ago • 117 comments

This PR addresses several problems with the current GLM implementation.

Current status In master, GLM/LM only accepts weights through the keyword wts. These weights are implicitly frequency weights.

With this PR FrequencyWeights, AnalyticWeights, and ProbabilityWeights are possible. The API is the following

## Frequency Weights
lm(@formula(y~x), df; wts=fweights(df.wts)
## Analytic Weights
lm(@formula(y~x), df; wts=aweights(df.wts)
## ProbabilityWeights
lm(@formula(y~x), df; wts=pweights(df.wts)

The old behavior -- passing a vector wts=df.wts is deprecated and for the moment, the array os coerced df.wts to FrequencyWeights.

To allow dispatching on the weights, CholPred takes a parameter T<:AbstractWeights. The unweighted LM/GLM has UnitWeights as the parameter for the type.

This PR also implements residuals(r::RegressionModel; weighted::Bool=false) and modelmatrix(r::RegressionModel; weighted::Bool = false). The new signature for these two methods is pending in StatsApi.

There are many changes that I had to make to make everything work. Tests are passing, but some new feature needs new tests. Before implementing them, I wanted to ensure that the approach taken was liked.

I have also implemented momentmatrix, which returns the estimating function of the estimator. I arrived to the conclusion that it does not make sense to have a keyword argument weighted. Thus I will amend https://github.com/JuliaStats/StatsAPI.jl/pull/16 to remove such a keyword from the signature.

Update

I think I covered all the suggestions/comments with this exception as I have to think about it. Maybe this can be addressed later. The new standard errors (the one for ProbabilityWeights) also work in the rank deficient case (and so does cooksdistance).

Tests are passing and I think they cover everything that I have implemented. Also, added a section in the documentation about using Weights and updated jldoc with the new signature of CholeskyPivoted.

To do:

  • [X] Deal with weighted standard errors with rank deficient designs
  • [X] Document the new API
  • [X] Improve testing

Closes https://github.com/JuliaStats/GLM.jl/issues/186, https://github.com/JuliaStats/GLM.jl/issues/259.

gragusa avatar Jul 15 '22 16:07 gragusa

Codecov Report

Patch coverage: 84.08% and project coverage change: -2.67 :warning:

Comparison is base (c13577e) 90.48% compared to head (fa63a9a) 87.82%.

:exclamation: Current head fa63a9a differs from pull request most recent head 807731a. Consider uploading reports for the commit 807731a to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #487      +/-   ##
==========================================
- Coverage   90.48%   87.82%   -2.67%     
==========================================
  Files           8        8              
  Lines        1125     1191      +66     
==========================================
+ Hits         1018     1046      +28     
- Misses        107      145      +38     
Impacted Files Coverage Δ
src/GLM.jl 50.00% <ø> (-10.00%) :arrow_down:
src/glmfit.jl 81.41% <79.80%> (-0.51%) :arrow_down:
src/lm.jl 89.06% <82.35%> (-5.35%) :arrow_down:
src/glmtools.jl 93.49% <85.71%> (-0.96%) :arrow_down:
src/linpred.jl 91.41% <90.32%> (-6.92%) :arrow_down:

... and 1 file with indirect coverage changes

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar Jul 16 '22 08:07 codecov-commenter

Hey,

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

I think the interfacing should allow for a DataFrame input of weights, that would take care of such things (like it does for the other variables).

lrnv avatar Jul 20 '22 07:07 lrnv

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

not really. But it would be easy to make this a feature. But before digging further on this I would like to know whether there is consensus on the approach of this PR.

gragusa avatar Jul 20 '22 17:07 gragusa

FYI this appears to fix #420; a PR was started in #432 and the author closed for lack of time on their part to investigate CI failures.

Here's the test case pulled from #432 which passes with the in #487.

@testset "collinearity and weights" begin
    rng = StableRNG(1234321)
    x1 = randn(100)
    x1_2 = 3 * x1
    x2 = 10 * randn(100)
    x2_2 = -2.4 * x2
    y = 1 .+ randn() * x1 + randn() * x2 + 2 * randn(100)
    df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50))
    f = @formula(y ~ x1 + x2 + x3 + x4)
    lm_model = lm(f, df, wts = df.weights)#, dropcollinear = true)
    X = [ones(length(y)) x1_2 x2_2]
    W = Diagonal(df.weights)
    coef_naive = (X'W*X)\X'W*y
    @test lm_model.model.pp.chol isa CholeskyPivoted
    @test rank(lm_model.model.pp.chol) == 3
    @test isapprox(filter(!=(0.0), coef(lm_model)), coef_naive)
end

Can this test set be added?

Is there any other feedback for @gragusa ? It would be great to get this merged if good to go.

alecloudenback avatar Aug 14 '22 19:08 alecloudenback

Sorry for the long delay, I hadn't realized you were waiting for feedback. Looks great overall, please feel free to finish it! I'll try to find the time to make more specific comments.

nalimilan avatar Aug 28 '22 18:08 nalimilan

A very nice PR. In the tests can we have some test set that compares the results of aweights, fweights, and pweights for the same set of data (coeffs, predictions, covariance matrix of the estimates, p-values etc.).

bkamins avatar Aug 31 '22 08:08 bkamins

@nalimilan thanks for the careful review - I will make the suggestions part of the PR.

As for the testing - as suggested by @bkamins - probably we can use the same data to test everything with different weights. I will try to do this.

Mas for the QR: I want to add the pivoted QR - however I am not sure how to do this … it is easy to write a rank revealing pivoted qr decomposition, but I am not sure how to use to obtain the solution for the lin indep columns of the system.

Related to pivoting: something we should definitely add is the possibility for crossmatrix e modelmatrix to return the linear independent columns together with the indexes in the original full matrix. This would fix cookdistance (which only work form full rank designs) and also make momentmatrix useful for rank deficient. Is there any thoughts about this?

gragusa avatar Sep 01 '22 16:09 gragusa

@nalimilan or somebody with GitHub fu, Accidentally sync my fork at gragusa/GLM.jl with JuliaStats/GLM:master. I have the commits in a branch gragusa/GLM.jl/JuliaStats-master. But if I push changes to this branch, the PR is not updated. Any help?

gragusa avatar Sep 05 '22 17:09 gragusa

Are you sure? I see the same commits here and on the branch (Throw error if GlmResp are not AbastractWeights).

nalimilan avatar Sep 05 '22 17:09 nalimilan

Mas for the QR: I want to add the pivoted QR - however I am not sure how to do this … it is easy to write a rank revealing pivoted qr decomposition, but I am not sure how to use to obtain the solution for the lin indep columns of the system.

I haven't investigated this at all, but shouldn't this be similar to how it works for Cholesky decomposition?

Related to pivoting: something we should definitely add is the possibility for crossmatrix e modelmatrix to return the linear independent columns together with the indexes in the original full matrix. This would fix cookdistance (which only work form full rank designs) and also make momentmatrix useful for rank deficient. Is there any thoughts about this?

That's tricky. Let's discuss this in a separate issue as we have enough on our plate here. ;-)

nalimilan avatar Sep 05 '22 17:09 nalimilan

Are you sure? I see the same commits here and on the branch (Throw error if GlmResp are not AbastractWeights).

Yes. I can see the commit as well!

gragusa avatar Sep 05 '22 17:09 gragusa

That's tricky. Let's discuss this in a separate issue as we have enough on our plate here. ;-)

Ok. I will work on it, once I tame this monster here!

gragusa avatar Sep 05 '22 17:09 gragusa

I updated the documentation, but while it is correctly built locally, I see failures on checks here.

The reason is that the type signature for GLM and LM differ from the one I am getting locally. Locally I am using Julia 1.7 - the doc GitHub action probably uses Julia 1.8. Is it possible that type signature are different between versions? I will try later to build locally with 1.8 and update the doc in case that’s the reason.

gragusa avatar Sep 08 '22 10:09 gragusa

Is it possible that type signature are different between versions?

Yes, it is possible. Output changes are not considered breaking between Julia versions (but I have not checked if this is the case here; another common reason of such problems arises if locally you have a different set of packages loaded than on CI - then the prefixes of types might change)

bkamins avatar Sep 08 '22 11:09 bkamins

No itis not the prefix, is the CholPred signature output is different in the CI than locally and btw, the output seems to be wrong and messing with the curly brackets. But I am checking this carefully

Get Outlook for iOShttps://aka.ms/o0ukef


From: Bogumił Kamiński @.> Sent: Thursday, September 8, 2022 1:48:45 PM To: JuliaStats/GLM.jl @.> Cc: Giuseppe Ragusa @.>; Mention @.> Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487)

Is it possible that type signature are different between versions?

Yes, it is possible. Output changes are not considered breaking between Julia versions (but I have not checked if this is the case here; another common reason of such problems arises if locally you have a different set of packages loaded than on CI - then the prefixes of types might change)

— Reply to this email directly, view it on GitHubhttps://github.com/JuliaStats/GLM.jl/pull/487#issuecomment-1240609321, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAD5DAT4O6J2F3FSL2OCPT3V5HHB3ANCNFSM53WBWMMQ. You are receiving this because you were mentioned.Message ID: @.***>

gragusa avatar Sep 08 '22 11:09 gragusa

Now I have looked at the output - it seems that the order of 2 type parameters is reversed.

bkamins avatar Sep 08 '22 12:09 bkamins

Now I have looked at the output - it seems that the order of 2 type parameters is reversed.

What happened is that CholeskyPivoted changed the signature between versions

  • 1.7 CholeskyPivoted{Float64, Matrix{Float64}}

  • 1.8 CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}

gragusa avatar Sep 08 '22 12:09 gragusa

@nalimilan what do you think about this?

gragusa avatar Sep 20 '22 14:09 gragusa

@nalimilan I committed your suggested changes. I see you bumped several review comments that i had pervasively addressed (or answered). I am going to report these here:

  1. https://github.com/JuliaStats/GLM.jl/pull/487#discussion_r985297147 Why dof_residual changed type ? Because UnitWeights is of eltype Int (previously no weighting meant an empty float array.

  2. https://github.com/JuliaStats/GLM.jl/pull/487#discussion_r985297465 What is the question you are referring to?

  3. https://github.com/JuliaStats/GLM.jl/pull/487#discussion_r985297566

and other I think.

As far as the general PR: I am very very puzzled by the loglikelihood, deviance and nulldeviance. I tried R and stata and for weights they all give different conflicting results. I have to think about this.....I cannot see the code since R is GPL and Stata is proprietary.

gragusa avatar Oct 03 '22 11:10 gragusa

Ah sorry the GitHub UI doesn't show the threads in which I added the comments. These are: https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r958885835 https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978083032 https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978061661 https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978063134

  1. Why dof_residual changed type ? Because UnitWeights is of eltype Int (previously no weighting meant an empty float array.

Ah, I see, I think I originally defined nobs to always return a float to make the function type-stable, as the presence/absence of weights was not reflected in the model type. Now this is OK thanks to weight vector types.

As far as the general PR: I am very very puzzled by the loglikelihood, deviance and nulldeviance. I tried R and stata and for weights they all give different conflicting results. I have to think about this.....I cannot see the code since R is GPL and Stata is proprietary.

Can you point me to an example of a problematic model? I can have a look, including at the R code, since I'm not the one writing the implementation.

nalimilan avatar Oct 03 '22 12:10 nalimilan

https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978083032

Isn't this equivalent?

Suggested change

    X = modelmatrix(obj; weighted=isweighted(obj))    
    if k == size(X,2)         
        XtX = crossmodelmatrix(obj; weighted=isweighted(obj))
    X = modelmatrix(obj; weighted=true)    
    if k == size(X, 2)         
        XtX = crossmodelmatrix(obj; weighted=true)

modelmatrix is implemented as

function modelmatrix(obj::LinPredModel; weighted::Bool=false)
    if isweighted(obj)
        mul!(obj.pp.scratchm1, Diagonal(sqrt.(obj.pp.wts)), obj.pp.X)
    else
        obj.pp.X
    end
end

so the keyword argument does not do much.... should I define it as

function modelmatrix(obj::LinPredModel; weighted::Bool=isweighted(obj))
    if weighted
        mul!(obj.pp.scratchm1, Diagonal(sqrt.(obj.pp.wts)), obj.pp.X)
    else
        obj.pp.X
    end
end

After this change they will be equivalent....however, for the unweighted model, is better to specify weighted=FALSE to avoid a matrix multiplication.

https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978083032

For the standard case (homoskedastic errors), invchol returns a (p x p) matrix with NaN for coefficients corresponding to the collinear rows. The formula is $\hat{sigma}^2 (X'X)^{-1}$ and we will get NaN for the varaince/stderrors of these coefficients. For the weighted case, the formula for the variance is $(X'X)^{-1} \sum_{i=1}^n X_i'u_iu_i'X_i (X'X)^{-1}$, without the machinery NaN will propagate through the matrix products resulting in a NaN MATRIX. The machinery removes the NaN elements, computes the matrix products, and then expands the matrix back to its original size. I hope it makes sense.

https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978061661 ...about performance for the unweighted case.

Committed a change

https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r958885835

Will take care of this.....needs testing

gragusa avatar Oct 03 '22 15:10 gragusa

Likelihood issue

The likelihood (or quasi-likelihood) used for fitting in all three cases is the same - that's why we get the same point estimates.

The problem is that we can sometimes define post-estimation likelihood that (a little bit ad hoc) try to take into account the nature of the weights. Being ad hoc, they are arbitrary. See this document I made here html. You can see how Stata numbers are not the same with those of R and those of this PR.

Personally, I would define the likelihood in the same way for all weights - more transparent. If we want to define different likelihoods then we have to carefully document what they are (at the moment is not clear what they are in Stata - I know what they are in R).

gragusa avatar Oct 03 '22 16:10 gragusa

After this change they will be equivalent....however, for the unweighted model, is better to specify weighted=FALSE to avoid a matrix multiplication.

I'd use if weighted && isweighted(x) to avoid the unnecessary multiplication.

For the standard case (homoskedastic errors), invchol returns a (p x p) matrix with NaN for coefficients corresponding to the collinear rows. The formula is...

OK, thanks for the details.

The likelihood (or quasi-likelihood) used for fitting in all three cases is the same - that's why we get the same point estimates.

The problem is that we can sometimes define post-estimation likelihood that (a little bit ad hoc) try to take into account the nature of the weights. Being ad hoc, they are arbitrary. See this document I made here html. You can see how Stata numbers are not the same with those of R and those of this PR.

Personally, I would define the likelihood in the same way for all weights - more transparent. If we want to define different likelihoods then we have to carefully document what they are (at the moment is not clear what they are in Stata - I know what they are in R).

AFAICT we can't use the same likelihood definition for all weights types, as they need to be scale-free for probability weights, right? One criterion I can see to choose how to define the likelihood is that the McFadden pseudo-R² for normal family with identity link should be equal to the R² of the equivalent linear model. Does that make sense?

nalimilan avatar Oct 03 '22 16:10 nalimilan

AFAICT we can't use the same likelihood definition for all weights types, as they need to be scale-free for probability weights, right? One criterion I can see in choosing how to define the likelihood is that the McFadden pseudo-R² for a normal family with an identity link should be equal to the R² of the equivalent linear model. Does that make sense?

They all use the same likelihood up to a normalizing constant (which as you correctly pointed out depends on the scale of the weights). Also, the deviance is the same regardless of the weight type, which is what we have (consistent with R). The deviance is twice the difference between the loglik and the null loglik. Thus, the differences between these likelihoods are due to constant scaling, which would not matter when using the likelihood for comparison.

For instance, R survey package seems to calculate the likelihood as (minus) half the ~likelihood~ deviance (with a deviance that is equal to GLM.jl and R with analytic weights).

gragusa avatar Oct 03 '22 18:10 gragusa

For instance, R survey package seems to calculate the likelihood as (minus) half the likelihood (with a deviance that is equal to GLM.jl and R with analytic weights).

You mean "half the deviance", right? Indeed I just checked that it's the case with svyglm.

But yeah now I get that the constant factor goes away when computing the ratio of likelihoods so requiring the McFadden pseudo-R² to equal the R² isn't enough to choose a definition.

However, the definition used by svyglm doesn't seem to work well with other pseudo-R², as for example the Cox-Snell pseudo-R² 1 - exp(2 * (logLik(nullmodel) - logLik(model)) / nobs(model)) always seem to return 1 in my tests. This means that fitting a model with probability weights all equal to one doesn't give the same Cox-Snell pseudo-R² as fitting an unweighted model. That's problematic, right?

BTW, the R survey package is intended mainly for complex designs (strata, etc.), which means that most standard statistics don't work anyway. This may have lead the package to diverge from standard functions (e.g. logLik prints a warning). Since our scope is more limited here (no complex sampling) we might better use different definitions. Not sure.

nalimilan avatar Oct 03 '22 20:10 nalimilan

However, the definition used by svyglm doesn't seem to work well with other pseudo-R², as for example the Cox-Snell pseudo-R² 1 - exp(2 * (logLik(nullmodel) - logLik(model)) / nobs(model)) always seem to return 1 in my tests. This means that fitting a model with probability weights all equal to one doesn't give the same Cox-Snell pseudo-R² as fitting an unweighted model. That's problematic, right?

If the deviance is given by $$D = -2*(logLik(model)-logLik(nullmodel))$$

Then, $$pseudo-R^{2} = 1 - exp(D)/ nobs(model),$$ which is again equal regardless of the weighting type. I believe the differences between software are due to different implementations and choices of the normalization constant. For instance, STATA for aweights reports deviance using rescaled weights $\bar{w}i = w_i - \frac{\sum{i=1}^n w_i}{n}$.

gragusa avatar Oct 04 '22 08:10 gragusa

Note that the relation $$D = -2*(logLik(model)-logLik(nullmodel))$$ does not actually hold for even unweighted models (see https://github.com/JuliaStats/GLM.jl/pull/115#discussion_r45355268). And the pseudo-R² is defined in terms of log-likelihood, not in terms of the deviance.

In R, with svyglm, the deviance is the same when fitting an unweighted model and when fitting a model will all weights equal. But the log-likelihood is completely different, since R's glm (and GLM.jl) doesn't define the log-likelihood as $-D/2$. So I don't think we can adopt the definition from svyglm.

Don't you agree that the log-likelihood should be the same when fitting a model without weights and when fitting a model with probability weights being all equal? That seems to imply we should choose the right normalization of weights to ensure that property. A simple solution would be to normalize them to sum to the number of observations. Then we could use the same formula to compute the likelihood as for frequency weights.

For analytic weights the situation is more tricky as the definition isn't completely clear in Stata nor in StatsBase (see https://github.com/JuliaStats/StatsBase.jl/issues/758). But it seems that at least when all weights are equal to one we should get the same results as an unweighted model. Cc: @ParadaCarleton

nalimilan avatar Oct 04 '22 12:10 nalimilan

For analytic weights the situation is more tricky as the definition isn't completely clear in Stata nor in StatsBase (see JuliaStats/StatsBase.jl#758). But it seems that at least when all weights are equal to one we should get the same results as an unweighted model. Cc: @ParadaCarleton

I outlined a response there--sorry I forgot about it. Similar comments apply to GLM.jl: I don't think it's necessarily possible to create a single method for glm that deals with analytic weights, unless lots and lots of keywords are added to supply additional information. The problem with analytic weights is there's a lot of assumptions that go into a good meta-analysis. Questions like whether the means and variances can be considered homogenous, or whether there's possibly publication bias, and about which estimators are best (as the choice is highly nontrivial when the data has been pre-aggregated).

At some point, someone will probably have to create a Julia package for meta-analysis, but that will require lots of hard work and research by a professional statistician--probably more than we can fit into a single pull request for GLM.jl.

ParadaCarleton avatar Oct 06 '22 03:10 ParadaCarleton

Thanks for commenting. Though in other software (R, Stata...) it's standard to use weighted least squares estimation with analytic weights. For example, in R, ?lm says:

Non-‘NULL’ ‘weights’ can be used to indicate that different
observations have different variances (with the values in
‘weights’ being inversely proportional to the variances); or
equivalently, when the elements of ‘weights’ are positive integers
w_i, that each response y_i is the mean of w_i unit-weight
observations (including the case that there are w_i observations
equal to y_i and the data have been summarized). However, in the
latter case, notice that within-group variation is not used.
Therefore, the sigma estimate and residual degrees of freedom may
be suboptimal; in the case of replication weights, even wrong.
Hence, standard errors and analysis of variance tables should be
treated with care.

Do you think we should provide a way to get the same estimates as R, or is that definition too broken to be useful? If so, with which weights type?

If that's too tricky we could support only UnitWeights, FrequencyWeights and ProbablityWeights for now as the PR is already large.

nalimilan avatar Oct 07 '22 13:10 nalimilan

The issue in StatsBase.jl and this PR are pretty unrelated. No matter what weights are used, the estimate coefficients (the point estimates) will be the same. They all maximize $$\sum_{i=1}^n w_i \ell(\theta),$$ where $\ell(\theta)$ is the log-likelihood under the model that one will maximize if weights were not provided.

With ProbabilityWeights the variance is different. There is also the issue of what likelihood to report. Again, under ProbabilityWeights the log-likelihood being maximized is a quasi-likelihood, and different software reports different log-likelihood.

~AnaliticWeights~ ImportanceWeights are the most commonly used because they have less structure and are helpful for those cases where the weights are not related to sampling. For instance, in specific applications in labor economics, one weights the observations by hours worked by workers. Or in propensity score-based causal inference, the weights are the probability of being treated.

So we can keep splitting hairs about the weights and their definition, but I think it wold add only complexity. R admits only ~AnaliticWeights~ ImportanceWeights for lm and glm. Other more specialized packages then use glm and lm with ~AnaliticWeights~ ImportanceWeights: the survey package underneath uses glm for svyglm with ~analytic~ importance weights.

gragusa avatar Oct 08 '22 13:10 gragusa