glum GLM with log link doesn't always handle zero-heavy training data

I'm trying to fit Tweedie and Poisson based GLMs to data that is heavy in 0 values.

When fitting a GLM with a log link, the glum code tries to guess the intercept value for the target by averaging over the y-values in the training set and taking the log of the sample mean. However, if there are a lot of 0s in the training data, this sample mean could sometimes be 0. In these cases, taking the log of the sample mean is undefined, and glum raises an error.

This error can occur even in cases where the dependent variable y isn't always 0 on the training set. For example, if there is a cross-validation loop, and y is always 0 on the subset of training data used on a specific cross-validation iteration, then this issue can still arise. If, for example, we are using the default K-fold with K=5, then as long as more than 80 percent of the training observations are such that y=0, this error can occur.

Although I'm not an expert on GLMs, it seems to me that there should be a better way to handle these cases, since in principle you should be able to fit a GLM to data that has a lot of 0s in it. I wonder if there is some feature that could be added to the code that could handle these cases, where the intercept cannot be estimated because there are too many zeros?

Another thought that occurred to me is that in cases where > 80% of the observations are 0 for the dependent variable, the Tweedie or Poisson GLMs might not be the ideal model (for instance, it may be better to try something like zero inflated Poisson regression). Is it possible to add support for zero inflated models to glum in the future?

Jul 12 '22 03:07 thobanster

Hi @thobanster! Thanks for raising an issue.

I would like to split your question into two parts: 1) estimate a GLM with zero-heavy data, and 2) estimate a GLM on a subset that contains only zeroes. For both, if my answer below doesn't solve your issue, providing a minimal reproducible example would be very helpful.

For part 1), glum should be able to handle this case. Even with 1 instance that's not zero, the mean will not be zero, and you should be able to run the estimation. For instance:

import numpy as np
from glum import GeneralizedLinearRegressor

y = np.zeros(int(1e8))
X = np.random.normal(size=(int(1e8), 2))
mdl = GeneralizedLinearRegressor(family='poisson')
mdl.fit(X, y) #### THIS RAISES AN ERROR ####

y[0] = 1   # Now our dependent variable is very zero-heavy, but not **only** zeros
mdl = GeneralizedLinearRegressor(family='poisson')
mdl.fit(X, y)  # This works

For part 2), I think you should look for an alternative strategy. If your targets only contain a single class (here it's all zeros), I'm not sure how you would interpret the results. If you have a specific example in mind where this makes sense, let us know.

Now, regarding your specific example of using cross-validation, you should be able to fix your issue by using a stratified k-fold cross-validation approach. This will ensure that you don't end up by chance with a fold containing all zeros.

That being said, the error message should be more explicit.

Jul 15 '22 19:07 MarcAntoineSchmidtQC

Another thought that occurred to me is that in cases where > 80% of the observations are 0 for the dependent variable, the Tweedie or Poisson GLMs might not be the ideal model (for instance, it may be better to try something like zero inflated Poisson regression). Is it possible to add support for zero inflated models to glum in the future?

It's possible. It's not on our list of short-term priorities but if someone is willing to contribute a PR, it could be added relatively soon.

Jul 15 '22 20:07 MarcAntoineSchmidtQC

Hi, sorry for the late response to this! I think the stratified k-fold makes sense, although it's also still a bit limited in that (at least for the sklearn implementation) it can't handle the most extreme cases of sparsity where the number of non-zero values in the training set is less than the number of CV splits. Such sparse datasets can occur when the underlying rate generating the counts is very small.

Sep 15 '22 00:09 thobanster

For the other part of this issue, namely, zero-inflated models, I do not have an implementation of that yet, but I do have an implementation of a custom negative binomial distribution that I wrote for glum. Since negative binomial GLMs can handle over dispersed data, they sometimes do a good job of fitting data that is zero-inflated. If you think that having this Negative Binomial GLM in glum would be useful to other users, I could open a PR.

Sep 15 '22 00:09 thobanster

@thobanster do you still have an implementation of negative binomial and could open a PR for this?

Mar 27 '23 14:03 dawidkopczyk

glum glum copied to clipboard

GLM with log link doesn't always handle zero-heavy training data

glum
glum copied to clipboard