GLMNet.jl icon indicating copy to clipboard operation
GLMNet.jl copied to clipboard

Logistic regression fails if y is a string of vectors

Open biona001 opened this issue 2 years ago • 1 comments

From README:

For logistic models, y is either a string vector or a m x 2 matrix

But the following doesn't work

using GLMNet
y = ["M", "B", "M", "B"]
X = rand(4, 10)
glmnet(X, y, Binomial())

MethodError: no method matching glmnet(::Matrix{Float64}, ::Vector{String}, ::Binomial{Float64})
Closest candidates are:
  glmnet(::AbstractMatrix{T} where T, ::AbstractVector{T} where T, ::AbstractVector{T} where T) at /home/users/bbchu/.julia/packages/GLMNet/C8WKF/src/CoxNet.jl:151
  glmnet(::AbstractMatrix{T} where T, ::AbstractVector{T} where T, ::AbstractVector{T} where T, ::CoxPH; kw...) at /home/users/bbchu/.julia/packages/GLMNet/C8WKF/src/CoxNet.jl:151
  glmnet(::Matrix{Float64}, ::Vector{Float64}, ::Distribution; kw...) at /home/users/bbchu/.julia/packages/GLMNet/C8WKF/src/GLMNet.jl:485
  ...

Fortunately if y is a matrix with 2 columns, it does work

y = [1 0; 0 1; 0 1; 1 0]
X = rand(4, 10)
glmnet(X, y, Binomial())

Logistic GLMNet Solution Path (100 solutions for 10 predictors in 833 passes):
────────────────────────────────
       df    pct_dev           λ
────────────────────────────────
  [1]   0  0.0        0.476672
  [2]   1  0.0582906  0.455006
  [3]   1  0.11166    0.434325
  [4]   1  0.160737   0.414585
  [5]   1  0.206039   0.395741
  [6]   1  0.248      0.377754
  [7]   1  0.286986   0.360585
  ...

biona001 avatar May 11 '22 22:05 biona001

It looks like the method that supports the string-vector input is this one:

https://github.com/JuliaStats/GLMNet.jl/blob/8eff4c4f07374c6f6f7878b16dc02e90d444e9a1/src/Multinomial.jl#L191-L203

So this works:

using GLMNet
y = ["M", "B", "M", "B"]
X = rand(4, 10)
glmnet(X, y)

The reason it doesn't need a distribution is because it chooses between Binomial and Multinomial based on the number of unique values in y. This method could probably be extended to support passing a distribution, and I guess throwing an error if the distribution and y are incompatible.

At the very least the README should be updated

JackDunnNZ avatar May 16 '22 16:05 JackDunnNZ