GLM.jl icon indicating copy to clipboard operation
GLM.jl copied to clipboard

Validation, CI, Residual variance, compare with lm() in R

Open PharmCat opened this issue 6 years ago • 7 comments

Hello! I try to make code for bioequivalence studies and validate it with reference datasets (https://www.ncbi.nlm.nih.gov/pubmed/25212768). I use dataset: https://static-content.springer.com/esm/art%3A10.1208%2Fs12248-014-9661-0/MediaObjects/12248_2014_9661_MOESM1_ESM.txt for it. This data used for assessing the robustness of statistical software implementations. This validation should be done prior any result applience to any regulatory authority. So in R i can do (where 'pkdata' imported data from dataset):

lmobj <- lm(log(Var)~Trt+Per+Seq+Subj, data=pkdata)
lnCI <- confint(lmobj, c(“TrtT”), level=0.9)
exp(lnCI)

I have results for 90% CI: TrtT 0.9076208 0.9961624

In Julia I use GLM:

using DataFrames, GLM
df = readtable("12248_2014_9661_MOESM1_ESM.txt", header = true, separator = '\t')
df.LnVar = log.(df.Var)
df.Subj = string.(df.Subj)
ols = lm(@formula(LnVar ~ Seq+Per+Trt+Subj), df)
cint = confint(ols, 0.9)
print("Lower:")
println(exp(cint[4,1]))
print("Upper:")
println(exp(cint[4,2]))

And i have another results: Lower:0.9060577396300195 Upper:0.9978809351491588

Also i have to get residual variance and i don't know how to do this with GLM in Julia, in R i get it simply: summary(lmobj )$sigma^2

There are difference in DF and some Subj (8) coefficients. All instruments (SAS, SPSS, R) give me equal resulst and i suppose that lm() should be equal anywhere.

Also you can look here: https://discourse.julialang.org/t/confidence-interval-for-certain-contrast-and-model-residual/20339/4

PharmCat avatar Feb 02 '19 12:02 PharmCat

R seems to have one more degree of freedom, possibly because it drops the case where subj is 8. That doesn't explain the discrepancy though. Did you mention on discourse you thought this might be related to the decomposition? Glm.jl uses cholesky

mkborregaard avatar Feb 02 '19 19:02 mkborregaard

R use QR decomposition with pivoting. I suppose QR is generally using for linar models, and it recommended by https://www.stat.wisc.edu/courses/st849-bates/lectures/Orthogonal.pdf and in other software (for example you can examine SPSS algo http://www.sussex.ac.uk/its/pdfs/SPSS_Statistics_Algorithms_22.pdf).

Generally the QR decomposition is preferred to the Cholesky decomposition for least squares problems because there is a certain loss of precision when forming X′X.

So, if GLM.jl uses cholesky - maybe it is not goog choise for default solution.

For this factor structure DF for "model error" should be 16 = n - 2 where n - number of subjects.

PharmCat avatar Feb 02 '19 20:02 PharmCat

There are two differences between the R model and theGLM.jl model

  1. R automatically excludes a variable.
  2. R treats the Per variable as a factor.

After changing these two, Julia gives you the same confidence interval for Trt. I.e.

julia> categorical!(df, :Subj);

julia> categorical!(df, :Per);

julia> using RCall

julia> ols = lm(@formula(LnVar ~ Seq+Per+Trt+Subj), df, true);

julia> rols = R"lm(lm(log(Var)~Trt+Per+Seq+Subj, data=$df))";

julia> rcopy(R"exp(confint($rols, c(\"TrtT\"), level=0.9))") - exp.(confint(ols, 0.9)[4:4,:])
1×2 Array{Float64,2}:
 4.44089e-16  6.66134e-16

Automatic exclusion of variables is a bit of grey area. However, we should probably signal when the design matrix is numerically rank deficient.

andreasnoack avatar Feb 03 '19 11:02 andreasnoack

Hello! I checked code. Ok, this line should be:

categorical!(df, :Per);

but it not changing results. All shanges from:

ols = lm(@formula(LnVar ~ Seq+Per+Trt+Subj), df, true);

allowrankdeficient::Bool=false

It's my fault - i did't note, that there is "allowrankdeficient" statement ant it is false by default.

Unforchantly, docs don't tell what happaens when chousen true or false in this statement. And nothing in "Julia and R comparation", but i suppose that this question will arise more when more researchers will come from R.

May be it good idea to to make additional explanation in docs and add forward comparation for lm() with allowrankdeficient and without.

So, cholesky decomposition used in both cases? Or only when allowrankdeficient = false? (when true one subject exluded like in R)

PharmCat avatar Feb 03 '19 14:02 PharmCat

I can confirm that this fixes the issue - I guess this can be closed, or changed to a documentation issue on allowrankdeficient?

@PharmCat the discussion of decomposition type is unrelated, and is discussed elsewhere, e.g. https://github.com/JuliaStats/GLM.jl/pull/162 and https://github.com/JuliaStats/GLM.jl/issues/223 . Douglas Bates , the author of the "Matrix decompositions for regression analysis" paper you linked above, is one of the maintainers of this package.

mkborregaard avatar Feb 03 '19 20:02 mkborregaard

I can confirm that this fixes the issue - I guess this can be closed, or changed to a documentation issue on allowrankdeficient?

Hi! Thank you very much! I think so, it would be great if documentation clearly describe this moment. Or some additional comparation with R. Because many R-users are used to thinking that lm() by default works with rank-deficient design matrix.

PharmCat avatar Feb 03 '19 20:02 PharmCat

However, we should probably signal when the design matrix is numerically rank deficient.

So that's another thing to improve in addition to documentation. (Or maybe let's make allowrankdeficient=true the default).

nalimilan avatar Feb 03 '19 21:02 nalimilan