GLM.jl icon indicating copy to clipboard operation
GLM.jl copied to clipboard

ftest for GeneralizedLinearModel

Open jbrea opened this issue 6 years ago • 12 comments

Would it be possible to extend ftest such that it can handle GeneralizedLinearModels?

julia> using GLM, DataFrames

julia> df1 = DataFrame(condition = CategoricalArray(["a", "a", "b", "b"]),
                       data = [9, 10, 7, 8]);

julia> model1 = fit(LinearModel, @formula(data ~ 1 + condition), df1);

julia> nullmodel1 = fit(LinearModel, @formula(data ~ 1), df1);

julia> ftest(model1.model, nullmodel1.model)
        Res. DOF DOF ΔDOF    SSR    ΔSSR     R²    ΔR²     F*  p(>F)
Model 1        2   3      1.0000         0.8000
Model 2        3   2   -1 5.0000 -4.0000 0.0000 0.8000 8.0000 0.1056

julia> model2 = fit(GeneralizedLinearModel, @formula(data ~ 1 + condition), df1,
                    Normal(), IdentityLink());

julia> nullmodel2 = fit(GeneralizedLinearModel, @formula(data ~ 1), df1,
                    Normal(), IdentityLink());

julia> ftest(model2.model, nullmodel2.model)
ERROR: MethodError: no method matching ftest(::GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}}, ::GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}})                             
Stacktrace:
 [1] top-level scope at none:0
 [2] run_backend(::REPL.REPLBackend) at /home/j/.julia/packages/Revise/TmjcT/src/Revise.jl:766
 [3] (::getfield(Revise, Symbol("##58#60")){REPL.REPLBackend})() at ./task.jl:259

jbrea avatar Dec 10 '18 17:12 jbrea

The F test isn't defined for GLMs. But we could implement the likelihood ratio test (lrtest?).

nalimilan avatar Dec 12 '18 15:12 nalimilan

In the example above I think model1 = model2, mathematically. Yes, for general GLMs it would be nice to have the likelihood ratio test.

jbrea avatar Dec 12 '18 15:12 jbrea

I guess we could define ftest for GeneralizedLinearModel{<:Any,Normal{Float64},IdentityLink}. But how useful would it be given that you can do the same thing with LinearModel?

nalimilan avatar Dec 12 '18 15:12 nalimilan

Agreed, the usefulness is pretty limited :smile:. When I started exploring the package, I was just surprised that ftest didn't work for GeneralizedLinearModel{<:Any,Normal{Float64},IdentityLink}.

jbrea avatar Dec 12 '18 16:12 jbrea

The example given by the OP is actually a case of LinearModels being fit as GeneralizedLinearModels so, in theory, there is an F-test. However, it is simplest to just fit them as LinearModels

GeneralizedLinearModels do support an analysis of deviance, which is similar in spirit to analysis of variance but, I believe, is equivalent to a likelihood ratio test. Sometimes it is easier to evaluate the deviance as the sum of the deviance residuals than to evaluate the log-likelihood but if you are only doing it once per model that shouldn't matter.

dmbates avatar Dec 12 '18 16:12 dmbates

What would the formula be to obtain the nulldeviance for GeneralizedLinearModels without having to fit the intercept only model?

Nosferican avatar Dec 25 '18 17:12 Nosferican

You just need to compute the response under the null model, and then call the existing code to compute the deviance. The response under the null model is probably just the mean of the observed responses, though there might be some subtleties I'm missing.

nalimilan avatar Dec 25 '18 18:12 nalimilan

deviance(model) = -2loglikelihood(model) (in the loose sense) For Poisson, the mean response need not be an integer and can potentially be outside the support of the distribution leading to a -Inf log-likelihood. From the 2(y * log(y / μ) - y + μ) formula from Wikipedia, also having any 0 response in the case of the Poisson model will lead to -Inf.

Nosferican avatar Dec 25 '18 18:12 Nosferican

Not at all, Poisson can perfectly be generalized to non-integers so that's not a problem. Also x log(x) has a zero limit when x tends to zero.

nalimilan avatar Dec 25 '18 20:12 nalimilan

The value for the Poisson would be μ(X * (ones(y) \ y)) which I don't think one can get without solving the intercept only model. Furthermore, for the NegativeBinomial one needs to get the mle θ for the intercept only model in addition to the parameter estimate. For now I usually just solve the intercept only model as well and apply the log-likelihood/deviance of it as the null log-likelihood/deviance. The exception is with the linear model which I just use the mean response.

Nosferican avatar Dec 25 '18 22:12 Nosferican

I haven't really looked at the math, but I've done tests with a few values, and the fitted values for a Poisson model with only the intercept always seem to be equal to the mean response. (Negative binomial isn't a standard GLM so that's yet another issue.)

nalimilan avatar Dec 26 '18 14:12 nalimilan

I have lost track of the purpose here but there are a couple of observations that might help.

  • I'm not sure why fitting a null model needs a short-circuit form. Why not use @formula(Y ~ 1)? It turns out that for the Poisson the estimate of the "intercept" term in that model is the sample average but you don't need to know that.

  • Most of the time it is simpler and faster to evaluate the deviance as the sum of the (squared) deviance residuals rather than from the likelihood formula. The deviance residuals have the difference with the saturated model built-in and that usually has the effect of cancelling out some of the terms in the probability density or mass function that are awkward to evaluate. This was the reason for defining the deviance as the difference between negative twice the log-likelihood for the current model and the saturated model instead of just saying negative twice the log-likelihood.

  • I keep writing "(squared) deviance residuals" because, technically, the deviance residuals are the signed square roots of the quantities returned by the devresid method. I used the name devresid to ensure bug-for-bug compatibility with the R family specification. In retrospect that may not have been the best idea.

  • The functions xlogx and xlogy are defined to return the correct limit of some products like x * log(x) as x goes to zero.

dmbates avatar Dec 26 '18 20:12 dmbates