GLM.jl icon indicating copy to clipboard operation
GLM.jl copied to clipboard

Incorrect linear regression results

Open xianwenchen opened this issue 3 years ago • 50 comments

Car-Training.csv

This is a bit urgent.

In a home exam that it is ongoing, I have the attached dataset, where students need to estimate models.

Here are the Julia codes:

using GLM
using DataFrames
using CSV

data = CSV.read( "Car-Training.csv", DataFrame )
model = @formula( Price ~ Year + Mileage )
results = lm( model, data )

The output is the following:

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}

Price ~ 1 + Year + Mileage

Coefficients:
─────────────────────────────────────────────────────────────────────────────────
                  Coef.    Std. Error       t  Pr(>|t|)    Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept)   0.0        NaN           NaN       NaN     NaN          NaN
Year          8.17971      0.167978     48.70    <1e-73    7.84664      8.51278
Mileage      -0.0580528    0.00949846   -6.11    <1e-7    -0.0768865   -0.0392191
─────────────────────────────────────────────────────────────────────────────────

The intercept is not estimated. The other two coefficients are not correct.

I tried R, SPSS, and Excel. All gave the same results that are different from Julia. I post the results from R below:

R> data <- read.csv("Car-Training.csv")
R> lm(Price ~ Year + Mileage, data = data )

Call:
lm(formula = Price ~ Year + Mileage, data = data)

R> summary( lm(Price ~ Year + Mileage, data = data ) )

Call:
lm(formula = Price ~ Year + Mileage, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
 -2168   -835    -49    567   5402 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.19e+06   3.52e+05   -6.21  1.1e-08
Year         1.10e+03   1.75e+02    6.25  9.0e-09
Mileage     -2.38e-02   9.84e-03   -2.42    0.017

Residual standard error: 1300 on 104 degrees of freedom
Multiple R-squared:  0.464,     Adjusted R-squared:  0.454 
F-statistic: 45.1 on 2 and 104 DF,  p-value: 8e-15

Did I do something wrong here? Or is there an issue with GLM? If there is an issue with GLM, can a new version be published as soon as possible, so that I can notify the students who are taking this exam right now?

xianwenchen avatar Apr 27 '21 11:04 xianwenchen

I forgot to say that I am using Julia 1.5.3 on Linux and 1.5.4 on Windows. Both produced same results.

xianwenchen avatar Apr 27 '21 11:04 xianwenchen

@nalimilan, could you have a quick look at the issue and confirm or reject that it is a problem of GLM?

xianwenchen avatar Apr 27 '21 11:04 xianwenchen

I also started a discussion on Discourse: https://discourse.julialang.org/t/can-someone-replicate-this-glm-problem-with-linear-regression-on-your-computer/60098

rafael.guerra found out that by subtracting 2000 from Year, the regression was estimated correctly. He wonders if the problem was due to a normalization error.

I think now it becomes more likely that the problem was within GLM.jl.

I’m teaching a course that uses basic regression and conducts simple forecasting. Subtracting 2000 will solve the problem. However, it also means that when forecasting, one needs to pay attention to this as well, which is unnecessarily complicating the issue.

And this is a part of a take-home exam. So it will not be very welcomed by students to have this extra complexity.

I hope that the GLM’s maintainer or someone who knows better Julia-fu can fix the issue and publish a newer version of GLM, so that I can simply ask students to update the GLM package. This will be a better and simpler solution, I think.

xianwenchen avatar Apr 27 '21 12:04 xianwenchen

As mentioned on discourse, this stems from dropcollinear=true in lm. For some reason, it's thinking that the intercept and other variables are collinear, so it doesn't estimate the intercept term.

Oddly, it doesn't fail with dropcollinear=false, but rather gives the same (correct) results as in R.

julia> data = CSV.File("Car-Training.csv");

julia> model = @formula( Price ~ Year + Mileage);

julia> lm(model, data, dropcollinear=true)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

Price ~ 1 + Year + Mileage

Coefficients:
─────────────────────────────────────────────────────────────────────────────────
                  Coef.    Std. Error       t  Pr(>|t|)    Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept)   0.0        NaN           NaN       NaN     NaN          NaN
Year          8.17971      0.167978     48.70    <1e-73    7.84664      8.51278
Mileage      -0.0580528    0.00949846   -6.11    <1e-07   -0.0768865   -0.0392191
─────────────────────────────────────────────────────────────────────────────────

julia> lm(model, data, dropcollinear=false)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

Price ~ 1 + Year + Mileage

Coefficients:
────────────────────────────────────────────────────────────────────────────────────
                    Coef.    Std. Error      t  Pr(>|t|)    Lower 95%      Upper 95%
────────────────────────────────────────────────────────────────────────────────────
(Intercept)    -2.18759e6    3.52442e5   -6.21    <1e-07   -2.8865e6     -1.48868e6
Year         1096.15       175.282        6.25    <1e-08  748.558      1443.74
Mileage        -0.0238368    0.00984145  -2.42    0.0172   -0.0433527    -0.00432081
────────────────────────────────────────────────────────────────────────────────────

pdeffebach avatar Apr 27 '21 13:04 pdeffebach

Thank you. This indeed solved the issue. I will prepare a note for students so that they are aware.

As I have commented on Discourse, I’m not sure if this is a sane option to set dropcollinear=true as the default. I have used a number of statistical software. This is the first time such an option is set as the default.

xianwenchen avatar Apr 27 '21 13:04 xianwenchen

As you have mentioned on Discourse, Stata implements dropcollinear=true. This means that my previous comment was invalid.

xianwenchen avatar Apr 27 '21 13:04 xianwenchen

The cause is this line: https://github.com/JuliaStats/GLM.jl/blob/16268be9eee54769b72ac36d17e5fd38337c0aae/src/linpred.jl#L107

I don't understand cholesky factorization well enough. It's odd that cholesky! with Val(true) doesn't return roughly the same output as cholesky! when the input matrix is of full rank. Should we check the rank ourself choose cholesky!(F) if collinear is true?

cc @andreasnoack

pdeffebach avatar Apr 27 '21 13:04 pdeffebach

Just curious. How do you debug in Julia? Is there a track function? Or do you just print a lot of intermediate results to see where it went wrong?

xianwenchen avatar Apr 27 '21 13:04 xianwenchen

In this instance I had a feeling and knew the library well enough to know what functions were being called.

But for debugging I generally use Exfiltrator.jl and then lots of printing and "break points" i.e. asdf in the code somewhere. I'm not a very sophisticated debugger.

pdeffebach avatar Apr 27 '21 13:04 pdeffebach

Thank you!

xianwenchen avatar Apr 27 '21 14:04 xianwenchen

I'd like to try to help. I don't fully understand the code. Could you rewrite F = pivot ? cholesky!(F, Val(true), tol = -one(T), check = false) : cholesky!(F) into several lines here?

cholesky!(F, Val(true), tol = -one(T), check = false) seems to be a function called cholesky() that takes four parameters.

A ! is added after cholesky. I don't remember what ! does in Julia. However, I think I can find an answer myself to this one.

What does the :cholesky!(F) do? Especially, what does : mean here?

And what does the pivot ? do, right after the = sign?

xianwenchen avatar Apr 27 '21 14:04 xianwenchen

There are two distinct misunderstandings here.

The first is Julia syntax. The syntax

x = a ? b : c

is the same as

x = begin 
    if a == true 
        b
    else
        c
    end
end

so in this case pivot is a boolean.

The ! in cholesky! is not syntax, but rather a naming convention. We name functions to end in ! if it modifies the input. cholesky!(F, ...) means the function overwrites F. \

The next misunderstanding is what's going on mathematically. To better understand why we call cholesky, check out this short explainer.

pdeffebach avatar Apr 27 '21 14:04 pdeffebach

It seems that the default tolerance (tol=-1) is too large so the decomposition fails:

julia> X = results.mm.m;

julia> F = Hermitian(float(X'X));

julia> cholesky(F, Val(true), tol=1e-5, check=true)
CholeskyPivoted{Float64, Matrix{Float64}}
U factor with rank 3:
3×3 UpperTriangular{Float64, Matrix{Float64}}:
 3.6764e5  18721.7   9.31674
  ⋅         9036.63  4.49425
  ⋅             ⋅    0.00369674
permutation:
3-element Vector{Int64}:
 3
 2
 1

julia> cholesky(F, Val(true), tol=1e-4, check=true)
ERROR: RankDeficientException(1)
[...]

julia> cholesky(F, Val(true), tol=-1, check=true)
ERROR: RankDeficientException(1)
[...]

The default tolerance is defined like this:

          TOL is DOUBLE PRECISION
          User defined tolerance. If TOL < 0, then N*U*MAX( A(K,K) )
          will be used. The algorithm terminates at the (K-1)st step
          if the pivot <= TOL.

nalimilan avatar Apr 27 '21 15:04 nalimilan

R uses tol = 1e-7. See here.

But they aren't calling LAPACK. Rather, they have their own decomposition code defined in lm.c.

pdeffebach avatar Apr 27 '21 16:04 pdeffebach

That code seems to use the QR decomposition, so do you think we can compare the tolerance?

nalimilan avatar Apr 27 '21 16:04 nalimilan

Tha's a good point. Let's try fixest whic uses Cholesky.

They define their own cholesky factorization here.

But their tol keyword argument works differently than Julia's.

EDIT: I did this wrong. 

There's some sort of scaling that goes on such that the keyword argument passed to feols isn't the same as the one passed to cpp_cholesky.

I wonder if it's worth pinging Laurent on this. He's been saying he's doing more things in Julia.

pdeffebach avatar Apr 27 '21 17:04 pdeffebach

Thank you.

While I am not able to fully follow the technical discussion, I am grateful to see the progress.

I think part of the issue is from the fact that the Year variable in my exam dataset is quite invariant. The values are very close to the mean and does not change more than around 1% of the mean.

This triggers the question of how much tol should be.

Below I create a new variable that follows a normal distribution, which has a larger mean than Year and smaller deviation as compared to the mean.

using GLM
using DataFrames
using Distributions
using CSV

data = CSV.read( "Car-Training.csv", DataFrame )

sample_size = length( data.Year )

data.NewVar = rand( Normal( 100000, 1 ), sample_size )

model = @formula( Price ~ NewVar + Mileage )
results = lm( model, data )
results = lm( model, data, dropcollinear = false )

Could you test whether an even smaller tol is required to correctly estimate the model?

Is tol a parameter to the lm() function? If it seems that specifying tol is required to make the algorithm work at special cases, it may be a good idea to let users specify tol in the lm() function.

What is the practical implication of changing tol value from 1 to 1e-5 or 1e-7 temporarily?

I am wondering if it is possible to release a new version of GLM.jl, with tol set to for example 1e-7? If that's possible, I can ask students to simply update their GLM.jl, for the take-home exam that they are working on, which uses the dataset that I posted here. The benefits will be that the disturbance to the exam will be small, and that a more proper value of tol or a more proper way of handling this issue could be addressed in a future release of GLM.jl.

xianwenchen avatar Apr 27 '21 19:04 xianwenchen

While changing the tol option to a smaller default or allowing it as a keyword argument are likely solutions to this problem, it's not going to happen in a release in time for your student's exam.

I would suggest creating a variable of years since 2008, i.e. df.Year = df.Year .- minimum(df.Year).

This is a bummer of a bug, though. It's unfortunate that we are uncovering this behavior, but thank you for bringing it to our attention.

pdeffebach avatar Apr 27 '21 19:04 pdeffebach

Digging into the git blame on this, it looks like @dmbates originally decided the tolerance for cholesky! and set it equal to -1. So he is also a good person to ask about that number in particular.

pdeffebach avatar Apr 28 '21 22:04 pdeffebach

@pdeffebach That threshold stems from the days when that was the threshold in LinearAlgebra for the QR and Cholesky decompositions (cf this discussion).

Changing the tolerance won't always fix the problem here though because if you're that ill conditioned, then you may or may not hit the threshold on different architectures. (Floating point math isn't associative so e.g. using AVX and other SIMD instructions can change the result. This led to intermittent test failures in MixedModels depending on whether or not the test was running on a machine with AVX512 support, which is what motivated the discussion that I linked.) As mentioned in #375, the extra precision from the QR decomposition can sometimes help, albeit at a slight performance penalty.

But the best choice here is an affine transformation of Year so that all the variables are closer in scale and the coefficients are more interpretable (e.g. the intercept is not a measurement of "year 0", which is well before there were cars). Understanding that aspect -- interpretability of model coefficients as well as the existence of numeric issues in practice -- is something that I think is also very important for an applied statistics course.

palday avatar Apr 28 '21 22:04 palday

All that said, I think we should probably

  • expose the tolerances
  • make it easier to choose the QR or Cholesky routes
  • add to the documentation why having two algorithms is important

palday avatar Apr 28 '21 22:04 palday

I agree with @palday here. In particular, I believe that age is a more meaningful variable here. It doesn't change the general issue though. Numerical rank determination is not clear-cut.

I just want to mention that the QR wouldn't make a difference for this particular problem

julia> cond(cholesky(Symmetric(X'X), Val(true)).U)
9.957869521461289e7

julia> cond(qr(X, Val(true)).R)
9.957869522989982e7

Notice that both correspond to reduced rank with respect to the R tolerance 1e-7. The main difference is the pivoting strategy (and rank determination) used by the custom pivoted QR algorithm in R. I'm not aware of a numerical analysis of the algorithm used by R. It would be interesting to see such an analysis.

andreasnoack avatar Apr 29 '21 06:04 andreasnoack

The numerical concerns are important, but I'm not super happy with a situation where lm fails in Julia when it doesn't fail in R. I'm sure there are edge cases where it fails in R as well, and we should aim for our failures to be a subset of their failures.

Could we change the tolerance to 1e-5? There is a instability-robustness tradeoff, and i'm not sure at exactly which parameter value that really kicks in.

This problem was caused by me, making dropcollinear = true the default in #386. I think it's important that this problem does not occur when dropcollinear=false. This happens because cholesky (the method called when dropcollinear=false) calls cholesky!(H, val(true), tol = 0.0, check=false). We were already enforcing a very strict tolerance before that change and things were working okay.

Is there some other heuristic we can use to determine rank deficiency outside of Cholesky?

Of course, we can always revert dropcollinear=true. The goal of that PR was to make GLM easier for newcomers, but explaining rank deficiency is easier than explaining numerical errors in matrix decompositions.

pdeffebach avatar Apr 29 '21 13:04 pdeffebach

@pdeffebach The numerical issues won't go away. Poorly conditioned models will often have misleading standard errors and other pathologies. I'm pretty strongly opposed to making things "too easy" because there will always be cases where these things come up the surface. Model diagnostics (does my model seem to actually fit my data? did something go wrong in fitting the model?) and centering and scaling covariates should be applied stats 101. Users don't have to know how our internal numerics work, but they should understand that numerical issues can and do arise in real-world statistical problems and that there are ways to check for false convergence (diagnostics) and ways to help the computer (centering and scaling).

Note in your historical counterexample, there's a problem: check=false. That check disables the error checking and just assumes that the decomposition worked at a certain tolerance.

palday avatar Apr 29 '21 13:04 palday

Is there some other heuristic we can use to determine rank deficiency outside of Cholesky?

QR with Pivot, already discussed above.

Singular Value Decomposition, but this is expensive.

We could look at the condition number to guess whether or not a given calculation is likely to be ill-conditioned, but that's also an extra non trivial operation and doesn't tell what to do, just that whatever we do, there might be problems.

We have a whole page explaining how numerical rank deficiency crops up in mixed models and how rank deficiency is well defined in theory but really, really challenging in practice: https://juliastats.org/MixedModels.jl/stable/rankdeficiency/

palday avatar Apr 29 '21 13:04 palday

It might be a good idea to benchmark the singular value decomposition. As an iterative method it is definitely more expensive than direct methods like Cholesky and QR but some of the recent SVD methods are surprisingly fast. The point is that the condition number is a property of the singular values. @andreasnoack may be able to offer more informed commentary on the SVD methods that are available. I haven't really kept up.

dmbates avatar Apr 29 '21 13:04 dmbates

@andreasnoack I'm not equipped to do a full numerical instability analysis with R's algorithm, but here is a full MWE fo you.

R (from fixest) here

library(tidyverse)
library(fixest)
cpp_cholesky = utils::getFromNamespace("cpp_cholesky", "fixest"); 
df = read_csv("https://github.com/JuliaStats/GLM.jl/files/6384056/Car-Training.csv")
m = feols(Price ~ Year + Mileage, df)
X = model.matrix(m)
XtX = t(X) %*% X 
cpp_cholesky(XtX, tol=80.0) # full rank
cpp_cholesky(XtX, tol=81.0) # not full rank

and Julia

using CSV, HTTP, GLM, LinearAlgebra

url = "https://github.com/JuliaStats/GLM.jl/files/6384056/Car-Training.csv";
df = CSV.File(HTTP.get(url).body);
m = lm(@formula(Price ~ Year + Mileage), df);
X = m.mm.m;
XtX = Hermitian(X' * X);
cholesky(XtX, Val(true), tol = 1e-5, check = true) # full rank
cholesky(XtX, Val(true), tol = 1e-4, check = true) # not full rank
lapack_tol = N * eps(Float64) * maximum(diag(XtX))

I don't understand what fixest's tol is doing. It's odd that the default tolerance is 1e-10 but the invertability change happens at a value near 80.

pdeffebach avatar Apr 29 '21 14:04 pdeffebach

@pdeffebach It appears that your R code is not using the Cholesky factorization in R but a specific method, which from the name I expect is coded in C++, called cpp_cholesky in the fixest package.

dmbates avatar Apr 29 '21 15:04 dmbates

yeah, I know. It felt like a decent benchmark because fixest uses Cholesky by default, unlike base R.

pdeffebach avatar Apr 29 '21 16:04 pdeffebach

The code for that method starts around line 130 of https://github.com/lrberge/fixest/blob/master/src/lm_related.cpp

The linear algebra code we are using in Julia is LAPACK/OpenBLAS or MKL - code that has been developed by the foremost leaders in the field of numerical linear algebra and tested for over 40 years. I think when someone codes up their own C++ implementation it falls more on them to reconcile differences in results.

dmbates avatar Apr 29 '21 16:04 dmbates