glum GLM fit without penalty

I would be very nice to also fit completely unpenalized GLMs (as in base R glm). For categorical features, one then needs to drop one level, called reference or base level.

from plotnine.data import diamonds
from glum import GeneralizedLinearRegressor


# Use the 4 c's as feature
# Note that cut, color, clarity have dtype category
X = diamonds.loc[:,["carat", "cut", "color", "clarity"]]

# Targets
y = diamonds["price"]

glm = GeneralizedLinearRegressor(alpha=0, family="gamma", link="log")
glm.fit(X, y)

gives

LinAlgError: Matrix is singular.

The error is correct as no categorical level was dropped.

Dec 20 '21 19:12 lorentzenchr

xref https://github.com/Quantco/tabmat/issues/75

Dec 20 '21 19:12 jtilly

In applications where we need to drop a base level, we typically one-hot encode our categoricals before using glum. With few levels, that's also faster than using categorical types. As you can see in the tabmat issue, we have discussed this before but we were a bit unsure what a good API design for this. The easiest solution would be to drop the first category per variable. Do you think that would be good enough?

Dec 20 '21 20:12 jtilly

AFAIK, R also drops the first level so this seems like a good default, for sure, only in case of alpha=0. If someone still needs more control, she/he could go for sklearn.preprocessing.OneHotEncoder(drop=...) or for a formula solution.

Dec 20 '21 22:12 lorentzenchr

Now tabmat has support to drop the first column of a CategoricalMatrix, are there any plans to leverage this to allow direct fitting of unpenalised GLMs (i.e. without creating the design matrix prior to input)?

Mar 08 '22 08:03 peterlee18

Glum now offers a drop_first option (#571).

Mar 15 '23 08:03 lbittarello

That‘s great. Thank you for all your work!

Mar 17 '23 06:03 lorentzenchr

glum glum copied to clipboard

GLM fit without penalty

glum
glum copied to clipboard