GLM.jl
GLM.jl copied to clipboard
Incorrect linear regression results
This is a bit urgent.
In a home exam that it is ongoing, I have the attached dataset, where students need to estimate models.
Here are the Julia codes:
using GLM
using DataFrames
using CSV
data = CSV.read( "Car-Training.csv", DataFrame )
model = @formula( Price ~ Year + Mileage )
results = lm( model, data )
The output is the following:
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}
Price ~ 1 + Year + Mileage
Coefficients:
─────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept) 0.0 NaN NaN NaN NaN NaN
Year 8.17971 0.167978 48.70 <1e-73 7.84664 8.51278
Mileage -0.0580528 0.00949846 -6.11 <1e-7 -0.0768865 -0.0392191
─────────────────────────────────────────────────────────────────────────────────
The intercept is not estimated. The other two coefficients are not correct.
I tried R, SPSS, and Excel. All gave the same results that are different from Julia. I post the results from R below:
R> data <- read.csv("Car-Training.csv")
R> lm(Price ~ Year + Mileage, data = data )
Call:
lm(formula = Price ~ Year + Mileage, data = data)
R> summary( lm(Price ~ Year + Mileage, data = data ) )
Call:
lm(formula = Price ~ Year + Mileage, data = data)
Residuals:
Min 1Q Median 3Q Max
-2168 -835 -49 567 5402
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.19e+06 3.52e+05 -6.21 1.1e-08
Year 1.10e+03 1.75e+02 6.25 9.0e-09
Mileage -2.38e-02 9.84e-03 -2.42 0.017
Residual standard error: 1300 on 104 degrees of freedom
Multiple R-squared: 0.464, Adjusted R-squared: 0.454
F-statistic: 45.1 on 2 and 104 DF, p-value: 8e-15
Did I do something wrong here? Or is there an issue with GLM? If there is an issue with GLM, can a new version be published as soon as possible, so that I can notify the students who are taking this exam right now?
I forgot to say that I am using Julia 1.5.3 on Linux and 1.5.4 on Windows. Both produced same results.
@nalimilan, could you have a quick look at the issue and confirm or reject that it is a problem of GLM?
I also started a discussion on Discourse: https://discourse.julialang.org/t/can-someone-replicate-this-glm-problem-with-linear-regression-on-your-computer/60098
rafael.guerra found out that by subtracting 2000 from Year, the regression was estimated correctly. He wonders if the problem was due to a normalization error.
I think now it becomes more likely that the problem was within GLM.jl.
I’m teaching a course that uses basic regression and conducts simple forecasting. Subtracting 2000 will solve the problem. However, it also means that when forecasting, one needs to pay attention to this as well, which is unnecessarily complicating the issue.
And this is a part of a take-home exam. So it will not be very welcomed by students to have this extra complexity.
I hope that the GLM’s maintainer or someone who knows better Julia-fu can fix the issue and publish a newer version of GLM, so that I can simply ask students to update the GLM package. This will be a better and simpler solution, I think.
As mentioned on discourse, this stems from dropcollinear=true
in lm
. For some reason, it's thinking that the intercept and other variables are collinear, so it doesn't estimate the intercept term.
Oddly, it doesn't fail with dropcollinear=false
, but rather gives the same (correct) results as in R.
julia> data = CSV.File("Car-Training.csv");
julia> model = @formula( Price ~ Year + Mileage);
julia> lm(model, data, dropcollinear=true)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
Price ~ 1 + Year + Mileage
Coefficients:
─────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept) 0.0 NaN NaN NaN NaN NaN
Year 8.17971 0.167978 48.70 <1e-73 7.84664 8.51278
Mileage -0.0580528 0.00949846 -6.11 <1e-07 -0.0768865 -0.0392191
─────────────────────────────────────────────────────────────────────────────────
julia> lm(model, data, dropcollinear=false)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}
Price ~ 1 + Year + Mileage
Coefficients:
────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────────
(Intercept) -2.18759e6 3.52442e5 -6.21 <1e-07 -2.8865e6 -1.48868e6
Year 1096.15 175.282 6.25 <1e-08 748.558 1443.74
Mileage -0.0238368 0.00984145 -2.42 0.0172 -0.0433527 -0.00432081
────────────────────────────────────────────────────────────────────────────────────
Thank you. This indeed solved the issue. I will prepare a note for students so that they are aware.
As I have commented on Discourse, I’m not sure if this is a sane option to set dropcollinear=true
as the default. I have used a number of statistical software. This is the first time such an option is set as the default.
As you have mentioned on Discourse, Stata implements dropcollinear=true
. This means that my previous comment was invalid.
The cause is this line: https://github.com/JuliaStats/GLM.jl/blob/16268be9eee54769b72ac36d17e5fd38337c0aae/src/linpred.jl#L107
I don't understand cholesky factorization well enough. It's odd that cholesky!
with Val(true)
doesn't return roughly the same output as cholesky!
when the input matrix is of full rank. Should we check the rank ourself choose cholesky!(F)
if collinear is true
?
cc @andreasnoack
Just curious. How do you debug in Julia? Is there a track function? Or do you just print a lot of intermediate results to see where it went wrong?
In this instance I had a feeling and knew the library well enough to know what functions were being called.
But for debugging I generally use Exfiltrator.jl
and then lots of printing and "break points" i.e. asdf
in the code somewhere. I'm not a very sophisticated debugger.
Thank you!
I'd like to try to help. I don't fully understand the code. Could you rewrite F = pivot ? cholesky!(F, Val(true), tol = -one(T), check = false) : cholesky!(F)
into several lines here?
cholesky!(F, Val(true), tol = -one(T), check = false)
seems to be a function called cholesky()
that takes four parameters.
A ! is added after cholesky
. I don't remember what !
does in Julia. However, I think I can find an answer myself to this one.
What does the :cholesky!(F)
do? Especially, what does :
mean here?
And what does the pivot ?
do, right after the =
sign?
There are two distinct misunderstandings here.
The first is Julia syntax. The syntax
x = a ? b : c
is the same as
x = begin
if a == true
b
else
c
end
end
so in this case pivot
is a boolean.
The !
in cholesky!
is not syntax, but rather a naming convention. We name functions to end in !
if it modifies the input. cholesky!(F, ...)
means the function overwrites F
. \
The next misunderstanding is what's going on mathematically. To better understand why we call cholesky
, check out this short explainer.
It seems that the default tolerance (tol=-1
) is too large so the decomposition fails:
julia> X = results.mm.m;
julia> F = Hermitian(float(X'X));
julia> cholesky(F, Val(true), tol=1e-5, check=true)
CholeskyPivoted{Float64, Matrix{Float64}}
U factor with rank 3:
3×3 UpperTriangular{Float64, Matrix{Float64}}:
3.6764e5 18721.7 9.31674
⋅ 9036.63 4.49425
⋅ ⋅ 0.00369674
permutation:
3-element Vector{Int64}:
3
2
1
julia> cholesky(F, Val(true), tol=1e-4, check=true)
ERROR: RankDeficientException(1)
[...]
julia> cholesky(F, Val(true), tol=-1, check=true)
ERROR: RankDeficientException(1)
[...]
The default tolerance is defined like this:
TOL is DOUBLE PRECISION
User defined tolerance. If TOL < 0, then N*U*MAX( A(K,K) )
will be used. The algorithm terminates at the (K-1)st step
if the pivot <= TOL.
R uses tol = 1e-7
. See here.
But they aren't calling LAPACK. Rather, they have their own decomposition code defined in lm.c
.
That code seems to use the QR decomposition, so do you think we can compare the tolerance?
Tha's a good point. Let's try fixest
whic uses Cholesky.
They define their own cholesky factorization here.
But their tol
keyword argument works differently than Julia's.
EDIT: I did this wrong.
There's some sort of scaling that goes on such that the keyword argument passed to feols
isn't the same as the one passed to cpp_cholesky
.
I wonder if it's worth pinging Laurent on this. He's been saying he's doing more things in Julia.
Thank you.
While I am not able to fully follow the technical discussion, I am grateful to see the progress.
I think part of the issue is from the fact that the Year variable in my exam dataset is quite invariant. The values are very close to the mean and does not change more than around 1% of the mean.
This triggers the question of how much tol
should be.
Below I create a new variable that follows a normal distribution, which has a larger mean than Year and smaller deviation as compared to the mean.
using GLM
using DataFrames
using Distributions
using CSV
data = CSV.read( "Car-Training.csv", DataFrame )
sample_size = length( data.Year )
data.NewVar = rand( Normal( 100000, 1 ), sample_size )
model = @formula( Price ~ NewVar + Mileage )
results = lm( model, data )
results = lm( model, data, dropcollinear = false )
Could you test whether an even smaller tol
is required to correctly estimate the model?
Is tol
a parameter to the lm()
function? If it seems that specifying tol
is required to make the algorithm work at special cases, it may be a good idea to let users specify tol
in the lm()
function.
What is the practical implication of changing tol
value from 1
to 1e-5
or 1e-7
temporarily?
I am wondering if it is possible to release a new version of GLM.jl, with tol
set to for example 1e-7
? If that's possible, I can ask students to simply update their GLM.jl, for the take-home exam that they are working on, which uses the dataset that I posted here. The benefits will be that the disturbance to the exam will be small, and that a more proper value of tol
or a more proper way of handling this issue could be addressed in a future release of GLM.jl.
While changing the tol
option to a smaller default or allowing it as a keyword argument are likely solutions to this problem, it's not going to happen in a release in time for your student's exam.
I would suggest creating a variable of years since 2008, i.e. df.Year = df.Year .- minimum(df.Year)
.
This is a bummer of a bug, though. It's unfortunate that we are uncovering this behavior, but thank you for bringing it to our attention.
Digging into the git blame
on this, it looks like @dmbates originally decided the tolerance for cholesky!
and set it equal to -1
. So he is also a good person to ask about that number in particular.
@pdeffebach That threshold stems from the days when that was the threshold in LinearAlgebra
for the QR and Cholesky decompositions (cf this discussion).
Changing the tolerance won't always fix the problem here though because if you're that ill conditioned, then you may or may not hit the threshold on different architectures. (Floating point math isn't associative so e.g. using AVX and other SIMD instructions can change the result. This led to intermittent test failures in MixedModels depending on whether or not the test was running on a machine with AVX512 support, which is what motivated the discussion that I linked.) As mentioned in #375, the extra precision from the QR decomposition can sometimes help, albeit at a slight performance penalty.
But the best choice here is an affine transformation of Year
so that all the variables are closer in scale and the coefficients are more interpretable (e.g. the intercept is not a measurement of "year 0", which is well before there were cars). Understanding that aspect -- interpretability of model coefficients as well as the existence of numeric issues in practice -- is something that I think is also very important for an applied statistics course.
All that said, I think we should probably
- expose the tolerances
- make it easier to choose the QR or Cholesky routes
- add to the documentation why having two algorithms is important
I agree with @palday here. In particular, I believe that age is a more meaningful variable here. It doesn't change the general issue though. Numerical rank determination is not clear-cut.
I just want to mention that the QR wouldn't make a difference for this particular problem
julia> cond(cholesky(Symmetric(X'X), Val(true)).U)
9.957869521461289e7
julia> cond(qr(X, Val(true)).R)
9.957869522989982e7
Notice that both correspond to reduced rank with respect to the R tolerance 1e-7
. The main difference is the pivoting strategy (and rank determination) used by the custom pivoted QR algorithm in R. I'm not aware of a numerical analysis of the algorithm used by R. It would be interesting to see such an analysis.
The numerical concerns are important, but I'm not super happy with a situation where lm
fails in Julia when it doesn't fail in R. I'm sure there are edge cases where it fails in R as well, and we should aim for our failures to be a subset of their failures.
Could we change the tolerance to 1e-5
? There is a instability-robustness tradeoff, and i'm not sure at exactly which parameter value that really kicks in.
This problem was caused by me, making dropcollinear = true
the default in #386. I think it's important that this problem does not occur when dropcollinear=false
. This happens because cholesky
(the method called when dropcollinear=false
) calls cholesky!(H, val(true), tol = 0.0, check=false)
. We were already enforcing a very strict tolerance before that change and things were working okay.
Is there some other heuristic we can use to determine rank deficiency outside of Cholesky?
Of course, we can always revert dropcollinear=true
. The goal of that PR was to make GLM
easier for newcomers, but explaining rank deficiency is easier than explaining numerical errors in matrix decompositions.
@pdeffebach The numerical issues won't go away. Poorly conditioned models will often have misleading standard errors and other pathologies. I'm pretty strongly opposed to making things "too easy" because there will always be cases where these things come up the surface. Model diagnostics (does my model seem to actually fit my data? did something go wrong in fitting the model?) and centering and scaling covariates should be applied stats 101. Users don't have to know how our internal numerics work, but they should understand that numerical issues can and do arise in real-world statistical problems and that there are ways to check for false convergence (diagnostics) and ways to help the computer (centering and scaling).
Note in your historical counterexample, there's a problem: check=false
. That check disables the error checking and just assumes that the decomposition worked at a certain tolerance.
Is there some other heuristic we can use to determine rank deficiency outside of Cholesky?
QR with Pivot, already discussed above.
Singular Value Decomposition, but this is expensive.
We could look at the condition number to guess whether or not a given calculation is likely to be ill-conditioned, but that's also an extra non trivial operation and doesn't tell what to do, just that whatever we do, there might be problems.
We have a whole page explaining how numerical rank deficiency crops up in mixed models and how rank deficiency is well defined in theory but really, really challenging in practice: https://juliastats.org/MixedModels.jl/stable/rankdeficiency/
It might be a good idea to benchmark the singular value decomposition. As an iterative method it is definitely more expensive than direct methods like Cholesky and QR but some of the recent SVD methods are surprisingly fast. The point is that the condition number is a property of the singular values. @andreasnoack may be able to offer more informed commentary on the SVD methods that are available. I haven't really kept up.
@andreasnoack I'm not equipped to do a full numerical instability analysis with R's algorithm, but here is a full MWE fo you.
R (from fixest
) here
library(tidyverse)
library(fixest)
cpp_cholesky = utils::getFromNamespace("cpp_cholesky", "fixest");
df = read_csv("https://github.com/JuliaStats/GLM.jl/files/6384056/Car-Training.csv")
m = feols(Price ~ Year + Mileage, df)
X = model.matrix(m)
XtX = t(X) %*% X
cpp_cholesky(XtX, tol=80.0) # full rank
cpp_cholesky(XtX, tol=81.0) # not full rank
and Julia
using CSV, HTTP, GLM, LinearAlgebra
url = "https://github.com/JuliaStats/GLM.jl/files/6384056/Car-Training.csv";
df = CSV.File(HTTP.get(url).body);
m = lm(@formula(Price ~ Year + Mileage), df);
X = m.mm.m;
XtX = Hermitian(X' * X);
cholesky(XtX, Val(true), tol = 1e-5, check = true) # full rank
cholesky(XtX, Val(true), tol = 1e-4, check = true) # not full rank
lapack_tol = N * eps(Float64) * maximum(diag(XtX))
I don't understand what fixest
's tol
is doing. It's odd that the default tolerance is 1e-10
but the invertability change happens at a value near 80
.
@pdeffebach It appears that your R code is not using the Cholesky factorization in R but a specific method, which from the name I expect is coded in C++, called cpp_cholesky
in the fixest
package.
yeah, I know. It felt like a decent benchmark because fixest
uses Cholesky by default, unlike base R.
The code for that method starts around line 130 of https://github.com/lrberge/fixest/blob/master/src/lm_related.cpp
The linear algebra code we are using in Julia is LAPACK/OpenBLAS or MKL - code that has been developed by the foremost leaders in the field of numerical linear algebra and tested for over 40 years. I think when someone codes up their own C++ implementation it falls more on them to reconcile differences in results.