ngboost adding BetaBernoulli distribution with LogScore

BetaBernoulli distribution implemented with LogScore, Fisher Information included.

Jun 14 '20 17:06 guyko81

@tonyduan are you planning on doing a re-review here? I see some relevant changes have been made.

Jul 13 '20 17:07 alejandroschuler

@tonyduan are you planning on doing a re-review here? I see some relevant changes have been made.

There's an identifiability issue so I'm recommending against implementing the BetaBernoulli. We should implement the BetaBinomial instead. Pasting my earlier comment below:

The FI is equivalent to the negative Hessian, so lines 85-89 below should have a negative sign.
After fixing the above, I can confirm the FI derived from the expected negative Hessian is always equal to the FI derived from the variance of the score (as we would expect).
However, the resulting FI is always singular. This is mathematically correct. It didn't occur to me earlier, but there's actually an identifiability issue here when n=1. Specifically, for any value of (alpha, beta) we can scale both by some scalar (k alpha, k beta) and it'd result in the same Bernoulli distribution p(x) for x in (0, 1). When n>1 this is not an issue since the second parameter controls the dispersion of the resulting Binomial distribution and model is identifiable. So the Beta-Binomial distribution only makes sense when n>1.
Moving forward we should still implement the Beta-Binomial distribution for n>1 (an implementation is provided by [GAMLSS], for example). It most likely makes more sense to follow their implementation using a location-scale parameterization instead of an alpha-beta parameterization. Taking the Hessian seems like it'd be annoying so I'd recommend following your variance derivation.

Jul 13 '20 18:07 tonyduan

@tonyduan got it, thanks for that. @guyko81 could you reimplement along those lines?

Jul 13 '20 18:07 alejandroschuler

I didn't express it clearly but I proposed this branch as an alternative for logistic regression or any binary classifier. Therefore creating the Beta-Binomial would add no benefit for that specific problem. Although I'm happy to implement the Beta-Binomial too, but I would still like to keep the Beta-Bernoulli as well, probably in this form.

Can I ask whether it's possible that the diagonal version of the FI matrix still helps finding the true Beta-Bernoulli distribution (strictly from mathematical point of view)?

I mean when I ran the model it gave reasonable results. I tested it on Kaggle and the model was at an acceptable level (around the random forest result), but obviously with an extra info of the uncertainty of the predicted probabilities. Or I'm in a false confidence just based on the expected value, and the uncertainty was just a random value?

Jul 16 '20 11:07 guyko81

I mean when I ran the model it gave reasonable results. I tested it on Kaggle and the model was at an acceptable level (around the random forest result), but obviously with an extra info of the uncertainty of the predicted probabilities. Or I'm in a false confidence just based on the expected value, and the uncertainty was just a random value?

Unfortunately I think it's the latter :(

You can tell this just by doing some math with the probability mass function of the beta-bernoulli, which is

P(Y=y) = B(y+a, 1-y+b)/B(a+b)

where a and b are the two parameters and B is the beta function. Plugging in the two cases y=0 and y=1 and using the properties of the beta and gamma functions we arrive at

P(Y=0) = b/(a+b) P(Y=1) = a/(a+b)

Call these values p*=a/(a+b) and 1-p* = b/(a+b). From the perspective of model fitting, the likelihood or scoring rule therefore only depends on a single parameter p* that is a function of a and b. i.e. the exact values of a and b only matter insofar as they produce a different value of p*. Note also that p* is the mean of p ~ Beta(a,b).

For any given p*, however, there are an infinite number of values for (a,b) that would produce it. As long as a = bp*/(1-p*), you're fine. Since a and b are what determine the uncertainty in your probability p ~ Beta(a,b), that means that you can have infinitely many distributions, all with different uncertainties on p, all of which are equally good in terms of log likelihood or any other scoring rule. So, yes, you will be able to find values of a,b that give good prediction (i.e. p* is good), but the uncertainty estimates for p are meaningless because there is a model with a different uncertainty (any uncertainty, actually) which is equally good.

Jul 16 '20 16:07 alejandroschuler

Just a suggestion but I think it would be nice as part of this PR to add to the distns test here for the new dist: https://github.com/stanfordmlgroup/ngboost/blob/master/ngboost/tests/test_distns.py

Jul 21 '20 13:07 ryan-wolbeck

@guyko81 should we close this PR in light of the issues with this model or are you interested in extending it to the n>1 case that @tonyduan described?

Moving forward we should still implement the Beta-Binomial distribution for n>1 (an implementation is provided by [GAMLSS], for example). It most likely makes more sense to follow their implementation using a location-scale parameterization instead of an alpha-beta parameterization. Taking the Hessian seems like it'd be annoying so I'd recommend following your variance derivation.

Sep 10 '20 17:09 alejandroschuler

@alejandroschuler I had many things on my plate, but I will find the time to do it in the upcoming days. Please keep it open for few days and I come back with an update. Thanks

Sep 11 '20 10:09 guyko81

@guyko81 just checking in on this pr

Dec 17 '20 18:12 ryan-wolbeck

I struggle to figure out how to handle n (trials) in the metric/d_score. The number of trials in BetaBinomial should come from the dataset, therefore it's not a parameter, but the equations expect them in the FI. Can someone help me out?

this is the metric function: def metric(self):

it has no n in the parameters

and this is the FI[:, 0, 0], but there is n: (sum(-numpy.exp(logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(0, _k +numpy.exp(logalpha))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(-numpy.exp(logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(0,numpy.exp(logalpha) +numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(numpy.exp(logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(0, n +numpy.exp(logalpha) +numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(numpy.exp(logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(0,numpy.exp(logalpha))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(-numpy.exp(2*logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(1, _k +numpy.exp(logalpha))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(-numpy.exp(2*logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(1,numpy.exp(logalpha) +numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(numpy.exp(2*logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(1, n +numpy.exp(logalpha) +numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(numpy.exp(2*logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(1,numpy.exp(logalpha))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))))

Jan 12 '21 13:01 guyko81

n is not a continuous parameter so it can't be optimized via gradient descent (without some kind of clever trickery I'm not aware of) and thus cannot be a parameter, as you say. It also can't be less than the maximum value observed in the data, so there are large regions of infinite score. So all said we shouldn't worry about optimizing over n.

That means n has to be passed as a fixed parameter during the initialization of the distribution. The way I'd deal with that is to write a factory function that generates the beta binomial of the given size, i.e.

beta_binomial(n: int) -> BetaBinomial
    class BetaBinomial
        n = n
        ...

    return BetaBinomial

this is the way I've implemented the categorical distribution so you can look there for inspiration. If you need the value of n in the score implementations you can simply call self.n and it will reference the class attribute n that you set in the factory function.

Jan 26 '21 17:01 alejandroschuler

ngboost ngboost copied to clipboard

adding BetaBernoulli distribution with LogScore

ngboost
ngboost copied to clipboard