GLM.jl icon indicating copy to clipboard operation
GLM.jl copied to clipboard

add residuals method for GLM

Open palday opened this issue 3 years ago • 25 comments

~I haven't had the time to figure out how the denominator is computed for Pearson residuals and so haven't yet implemented them.~

I set the default residual type to deviance to match R's behavior, though that might be surprising for GLMs with normal distribution and identity link.

Note: once again I've noticed that -0 in outputs creates a problem. Tests fail for me on Julia 1.8 with Apple Silicon because of sign-swapping on zero.

palday avatar Sep 20 '22 22:09 palday

Codecov Report

Patch coverage is 90.90% of modified lines.

Files Changed Coverage
src/glmfit.jl 90.90%

:loudspeaker: Thoughts on this report? Let us know!.

codecov-commenter avatar Sep 21 '22 17:09 codecov-commenter

@nalimilan the current doctests failures are unrelated to this PR, but do you have any objections to me updating them? For now, I would update update the expected output, but longterm we should improve pretty-printing of the types.

palday avatar Sep 21 '22 21:09 palday

Sure.

Would you feel like adding tests for all other model families? I'm always uncomfortable when what we test varies from one family to the next.

nalimilan avatar Sep 24 '22 17:09 nalimilan

Would you feel like adding tests for all other model families? I'm always uncomfortable when what we test varies from one family to the next.

Of course -- I've already got tests in there for deviance, response and working residuals of the families:

  • Normal with identity link
  • Binomial + Bernoulli
  • InverseGaussian
  • Gamma

I've also added commented out tests for the same families with Pearson residuals.

palday avatar Sep 29 '22 03:09 palday

Bump. Perhaps @mousum-github can also help review this PR?

ViralBShah avatar Nov 06 '22 01:11 ViralBShah

Thanks, @ViralBShah, let me go through the PR.

mousum-github avatar Nov 07 '22 03:11 mousum-github

We can easily add Pearson Residuals using the following: - sign.(response(model) .- fitted(model)) .* sqrt.(model.rr.wrkwt .* abs2.(model.rr.wrkresid))

(and the sum of square of Pearson residuals / residual dof, is the estimated dispersion parameter)

mousum-github avatar Nov 11 '22 06:11 mousum-github

@mousum-github Maybe you know the answer to https://github.com/JuliaStats/GLM.jl/pull/499/files#r976581242?

@palday Do you plan to add test for other method families?

nalimilan avatar Nov 13 '22 14:11 nalimilan

@mousum-github Maybe you know the answer to https://github.com/JuliaStats/GLM.jl/pull/499/files#r976581242?

@palday Do you plan to add test for other method families?

I am not sure about the relationship. I tried but was unable to establish such a relationship easily.

mousum-github avatar Nov 21 '22 19:11 mousum-github

I haven't had the time to figure out how the denominator is computed for Pearson residuals and so haven't yet implemented them.

It's based on the variance function:

$$ e_i = \frac{y_i - \hat{\mu}_i}{\sqrt{V(\hat{\mu}_i)}} $$

so should be sqrt(glmvar(model.rr.d, _))

ararslan avatar Dec 07 '22 02:12 ararslan

@nalimilan what families am I missing?

  • [x] Normal with identity link
  • [x] Binomial + Bernoulli
  • [x] InverseGaussian
  • [x] Gamma
  • [x] Poisson
  • [x] Geometric
  • [x] Negative Binomial

palday avatar Dec 09 '22 00:12 palday

@nalimilan what families am I missing?

How about testing residuals for all existing tests? We currently test all results (fit stats, coefs...) for all of them so that would be consistent. That would also ensure we don't miss a corner case (weights, no intercept, etc.).

nalimilan avatar Dec 17 '22 18:12 nalimilan

How about testing residuals for all existing tests?

FWIW I think this is pretty well tested as it stands and would favor having it merged without that. Perhaps once this is merged we can open an issue to add residuals tests for all models and someone can add tackle that in another PR.

ararslan avatar Jan 17 '23 21:01 ararslan

Models with weights are not tested though, we need to make sure the result is correct or an error is thrown.

nalimilan avatar Mar 28 '23 19:03 nalimilan

@mousum-github please don't start making changes on my PR without asking me first.

palday avatar May 04 '23 12:05 palday

I would like to add a few more test cases this week.

mousum-github avatar May 04 '23 12:05 mousum-github

@mousum-github please don't start making changes on my PR without asking me first.

Sure, @palday. It is my bad. Please let me know if I can do anything. I am preparing a few more test cases in R with weights and offsets.

mousum-github avatar May 04 '23 14:05 mousum-github

The following compares three types of residuals (response, deviance and Pearson) for different distributions and link functions in Julia and R. Overall the comparisons look good.

julia> using GLM, Random, LinearAlgebra, DataFrames, RCall

julia> Random.seed!(12345);

julia> df = DataFrame(x0 = ones(10), x1 = rand(10), x2 = rand(10), n = rand(50:70, 10), w=1 .+ 0.01*rand(10), off=1 .+ 0.01*rand(10));

julia> β = [1, -1, 1];

julia> X = convert(Matrix, df[:,1:3]);

julia> η = X*β;

julia> σ = 1;

julia>

julia> #  Poisson Distribution with Log link #

julia> Random.seed!(123456);

julia> μ = GLM.linkinv.(LogLink(), η);

julia> df["y"] = rand.(Poisson.(μ));

julia> mdl = glm(@formula(y ~ x1 + x2), df, Poisson(););

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> @rput df;

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=poisson());
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> isapprox(rresid1, jresid1, atol=1.0E-6) && isapprox(rresid2, jresid2, atol=1.0E-6) && isapprox(rresid3, jresid3, atol=1.0E-6)
true

julia> #  ..with weights and offset #

julia> mdl = glm(@formula(y ~ x1 + x2), df, Poisson(); wts=df["w"], offset=df["off"]);

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=poisson(), weights=w, offset=off);
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> isapprox(rresid1, jresid1, atol=1.0E-6) && isapprox(rresid2, jresid2, atol=1.0E-6) && isapprox(rresid3, jresid3, atol=1.0E-6)
true

julia>

julia> #  Binomial Distribution with Logit link #

julia> Random.seed!(123456);

julia> μ = GLM.linkinv.(LogitLink(), η);

julia> n = df["n"];

julia> df["y"] = rand.(Binomial.(n, μ)) ./ n;

julia> mdl = glm(@formula(y ~ x1 + x2), df, Binomial(););

julia> @rput df;

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=binomial());
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;
┌ Warning: RCall.jl: Warning in eval(family$initialize) :
│   non-integer #successes in a binomial glm!
└ @ RCall C:\Users\user\.julia\packages\RCall\6kphM\src\io.jl:172

julia> @rget rresid1 rresid2 rresid3;

julia> rresid1 ≈ jresid1 && rresid2 ≈ jresid2 && rresid3 ≈ jresid3
true

julia> # .. with weights and offset #

julia> mdl = glm(@formula(y ~ x1 + x2), df, Binomial(); wts=df["w"], offset=df["off"]);

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=binomial(), weights=w, offset=off);
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;
┌ Warning: RCall.jl: Warning in eval(family$initialize) :
│   non-integer #successes in a binomial glm!
│ Warning in eval(family$initialize) :
│   non-integer #successes in a binomial glm!
└ @ RCall C:\Users\user\.julia\packages\RCall\6kphM\src\io.jl:172

julia> @rget rresid1 rresid2 rresid3;

julia> rresid1 ≈ jresid1 && rresid2 ≈ jresid2 && rresid3 ≈ jresid3
true

julia>

julia> #  Bernoulli Distribution with Logit link #

julia> Random.seed!(123456);

julia> μ = GLM.linkinv.(LogitLink(), η);

julia> df["y"] = rand.(Bernoulli.(μ));

julia> mdl = glm(@formula(y ~ x1 + x2), df, Bernoulli(););

julia> @rput df;

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=binomial());
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> rresid1 ≈ jresid1 && rresid2 ≈ jresid2 && rresid3 ≈ jresid3
true

julia> # .. with weights and offset #

julia> mdl = glm(@formula(y ~ x1 + x2), df, Bernoulli(); wts=df["w"], offset=df["off"]);

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=binomial(), weights=w, offset=off);
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;
┌ Warning: RCall.jl: Warning in eval(family$initialize) :
│   non-integer #successes in a binomial glm!
│ Warning in eval(family$initialize) :
│   non-integer #successes in a binomial glm!
└ @ RCall C:\Users\user\.julia\packages\RCall\6kphM\src\io.jl:172

julia> @rget rresid1 rresid2 rresid3;

julia> rresid1 ≈ jresid1 && rresid2 ≈ jresid2 && rresid3 ≈ jresid3
true

julia>

julia>

julia> #  Negative Binomial Distribution with Logit link #

julia> Random.seed!(123456);

julia> μ = GLM.linkinv.(LogLink(), η);

julia> π = μ ./ (μ .+ 10.0);

julia> df["y"] = rand.(NegativeBinomial.(10, π));

julia> mdl = negbin(@formula(y ~ x1 + x2), df, LogLink(); rtol=1.0E-12);

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> @rput df;

julia> R"""
       library("MASS");
       mdl = glm.nb(y ~ x1+x2, data=df);
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> isapprox(rresid1, jresid1, atol=1.0E-4) && isapprox(rresid2, jresid2, atol=1.0E-4) && isapprox(rresid3, jresid3, atol=1.0E-4)
false

julia>

julia> #  Gamma Distribution with InverseLink link #

julia> Random.seed!(1234);

julia> μ = GLM.linkinv.(LogLink(), η);

julia> df["y"] = rand.(Gamma.(μ, σ));

julia> mdl = glm(@formula(y ~ x1 + x2), df, Gamma(););

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> @rput df;

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=Gamma());
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> isapprox(rresid1, jresid1, atol=1.0E-6) && isapprox(rresid2, jresid2, atol=1.0E-6) && isapprox(rresid3, jresid3, atol=1.0E-6)
true

julia> # .. with weights and offset #

julia> mdl = glm(@formula(y ~ x1 + x2), df, Gamma(),; wts=df["w"], offset=df["off"]);

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=Gamma(), weights=w, offset=off);
       cf = coef(mdl);
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> isapprox(rresid1, jresid1, atol=1.0E-6) && isapprox(rresid2, jresid2, atol=1.0E-6) && isapprox(rresid3, jresid3, atol=1.0E-6)
true

julia>

julia> #  InverseGaussian Distribution with InverseSquareLink link #

julia> Random.seed!(1234);

julia> μ = GLM.linkinv.(InverseSquareLink(), η);

julia> df["y"] = rand.(InverseGaussian.(μ, σ));

julia> mdl = glm(@formula(y ~ x1 + x2), df, InverseGaussian(););

julia> @rput df;

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=inverse.gaussian());
       cff = coef(mdl);
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> isapprox(rresid1, jresid1, atol=1.0E-6) && isapprox(rresid2, jresid2, atol=1.0E-6) && isapprox(rresid3, jresid3, atol=1.0E-6)
true

julia> # .. with weights and offset #

julia> mdl = glm(@formula(y ~ x1 + x2), df, InverseGaussian(); wts=df["w"], offset=df["off"]);

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=inverse.gaussian(), weights=w, offset=off);
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> isapprox(rresid1, jresid1, atol=1.0E-6) && isapprox(rresid2, jresid2, atol=1.0E-6) && isapprox(rresid3, jresid3, atol=1.0E-6)
true

mousum-github avatar May 18 '23 13:05 mousum-github

@palday Are you OK with @mousum-github adding more tests?

@mousum-github Thanks for checking. However, existing tests already fit models for all families, so better reuse these models and just add checks that residuals return the expected value rather than add new models.

nalimilan avatar May 28 '23 13:05 nalimilan

@nalimilan - I also considered suggesting some tests within existing test cases. Since there are conflicts in the runtests.jl right now, just checked new functionalities only with R, especially with weights and offsets.

mousum-github avatar May 29 '23 17:05 mousum-github

For residuals functionalities,

Existing Test cases:

  • Poisson GLM
  • (Binomial, Bernoulli) with LogitLink
  • Gamma
  • InverseGaussian
  • Geometric is a special case of NegativeBinomial
  • linear model with weights

So far I have added test cases (locally) to the following

  • Linear model
  • Linear model with weights
  • Linear model with dropcollinearity
  • Linear model with rankdeficieny
  • Linear model without intercept
  • Bernoulli ProbitLink
  • Bernoulli CauchitLink
  • Bernoulli CloglogLink
  • Normal with offset
  • Normal LogLink and offset
  • Poisson LogLink with offset and weights
  • Gamma with LogLink
  • Gamma with IdentityLink

And, I am planning to add some more test cases to the following by this week

  • GLM with no intercept
  • Geometric LogLink
  • NegativeBinomial LogLink
  • NegativeBinomial LogLink Fixed θ

Also, I have changed the existing residuals function for lm. Now it will support all types like GLM.

My plan is to raise a PR to pa/resid branch whenever I am done.

@nalimilan, please let us know if there is a better way to proceed.

mousum-github avatar Jun 07 '23 07:06 mousum-github

The changes in the residuals function with lm are like the following,

    resid = response(model) - fitted(model)
    if length(model.rr.wts) > 0 && (type === :deviance || type === :pearson)
        return resid .* sqrt.(model.rr.wts)
    else
        return resid
    end

mousum-github avatar Jun 07 '23 08:06 mousum-github

Sorry, I've been a bit underwater with other priorities. I will start integrating these suggestions -- thanks @mousum-github !

palday avatar Jun 07 '23 16:06 palday

My plan is to raise a PR to pa/resid branch whenever I am done.

@nalimilan, please let us know if there is a better way to proceed.

Thanks. I guess that's the best approach so that @palday can have a look at the changes.

nalimilan avatar Jun 11 '23 12:06 nalimilan

Thanks @nalimilan. I have raised the PR #540. @palday - hope it will help.

mousum-github avatar Jun 12 '23 22:06 mousum-github