GLMNet.jl icon indicating copy to clipboard operation
GLMNet.jl copied to clipboard

Allow GLM style specification of Bernoulli outcomes for logistic regression

Open johnmyleswhite opened this issue 11 years ago • 2 comments

I often like to use glm(X, y) where y is a vector of 0's and 1's and thought it would be nice to offer the same interface for working with glmnet. This patch makes some quick changes to allow that to happen. Let me know if you'd like me to handle the changes in another way since I didn't spend a lot of time thinking about the cleanest way to add this functionality.

The script below gives a basic demo of the extended functionality. I can make it into a test:

using GLMNet
using Distributions
using Base.Test

srand(1)

invlogit(z::Real) = 1 / (1 + exp(-z))

n, p = 250_000, 2

intercept = randn()
beta = randn(p)
X = randn(n, p)
y = X * beta
for i in 1:n
    y[i] = rand(Bernoulli(invlogit(intercept + y[i])))
end

path = glmnet(X, y, Binomial())
@test abs(intercept - path.a0[end]) < 0.1
@test norm(beta - convert(Matrix{Float64}, path.betas)[:, end]) < 0.1

johnmyleswhite avatar Jan 16 '14 04:01 johnmyleswhite

I agree that this is an API worth having. I had thought about this previously, but got bogged down in implementaiton. If X has a lot of duplicate rows, then I think the model fitting process would be faster if we pool the duplicate rows before calling lognet. This boils down to finding the unique rows of X. Doing this without allocation for each row seems possible but required more code than I was prepared to write at the time. Let's start with this approach and we can worry about efficiency later.

simonster avatar Jan 16 '14 18:01 simonster

I'm going to come back to this tomorrow. I've had some problems getting the solver to converge recently and need to delve deeper.

johnmyleswhite avatar Jan 17 '14 05:01 johnmyleswhite