glum
glum copied to clipboard
GLM fit without penalty
I would be very nice to also fit completely unpenalized GLMs (as in base R glm). For categorical features, one then needs to drop one level, called reference or base level.
from plotnine.data import diamonds
from glum import GeneralizedLinearRegressor
# Use the 4 c's as feature
# Note that cut, color, clarity have dtype category
X = diamonds.loc[:,["carat", "cut", "color", "clarity"]]
# Targets
y = diamonds["price"]
glm = GeneralizedLinearRegressor(alpha=0, family="gamma", link="log")
glm.fit(X, y)
gives
LinAlgError: Matrix is singular.
The error is correct as no categorical level was dropped.
xref https://github.com/Quantco/tabmat/issues/75
In applications where we need to drop a base level, we typically one-hot encode our categoricals before using glum. With few levels, that's also faster than using categorical types. As you can see in the tabmat issue, we have discussed this before but we were a bit unsure what a good API design for this. The easiest solution would be to drop the first category per variable. Do you think that would be good enough?
AFAIK, R also drops the first level so this seems like a good default, for sure, only in case of alpha=0
.
If someone still needs more control, she/he could go for sklearn.preprocessing.OneHotEncoder(drop=...)
or for a formula solution.
Now tabmat has support to drop the first column of a CategoricalMatrix, are there any plans to leverage this to allow direct fitting of unpenalised GLMs (i.e. without creating the design matrix prior to input)?
Glum now offers a drop_first
option (#571).
That‘s great. Thank you for all your work!