lifetimes icon indicating copy to clipboard operation
lifetimes copied to clipboard

penalizer_coeff: revisiting the theory and issues in practice

Open extrospective opened this issue 4 years ago • 0 comments

My understanding of penalizers is that they are intended to drive models from higher complexity to lower complexity and also have the benefit of avoiding overfitting. In this theoretical context I want to point out a few items:

a. The parameters in the models that are being penalized are highly interdependent. In some cases when one parameter goes up another one goes down. Therefore, it is not clear that simply summing the squares of these values and adding that to the log-likelihood exhibits the correct behavior.

b. In practice, I have tried to chart penalizers and output rather than letting an optimization model blindly run through the optimization. To do this, I have created metrics which would let me judge whether the model is better at explaining the holdout period (MAE, RMSE, Spearman-R for example). I see that in some cases there is non-monotonicity, so there may be local minima.

However, just as importantly, I can see what seem to be non-sensible results.

Here is a body of code which will exhibit one such example:

from lifetimes.datasets import load_cdnow_summary_data_with_monetary_value
from lifetimes import GammaGammaFitter


summary_with_money_value = load_cdnow_summary_data_with_monetary_value()
returning_customers_summary = summary_with_money_value[summary_with_money_value['frequency']>0]

for penalizer_coef in [0, 0.1, 0.5]:
    ggf = GammaGammaFitter(penalizer_coef=penalizer_coef)
    ggf.fit(returning_customers_summary['frequency'], returning_customers_summary['monetary_value'])
    print(f'{penalizer_coef}: {ggf.params_}')

This is evaluating the penalizer coefficient for the GammaGammaFitter.

The output shows the p, q, and v parameters from the model. The q parameter is below 1 if the penalizer is 0.1 or 0.5 in these cases.

My output: 0.0: p=6.25, q=3.74, v=15.45 0.1: p=1.28, q=0.34, v=1.15 0.5: p=0.54, q=0.20, v=0.43

Why is this a problem?

Review the code which estimates the conditional expected spend:

        individual_weight = p * frequency / (p * frequency + q - 1)
        population_mean = v * p / (q - 1)

If q < 1, then the population mean will turn negative if p and v remain > 0, which they do.

The model seems to have degenerated into a non-sensible condition, driven by the penalizer.

For my practice, what I did then is to chart the MAE and RMSE per penalizer value, but also to mark those values where q < 1 as particularly "invalid".

For my own data set you will notice some little red-x's behind the right most three points which reflect the invalid model. Also you will see the non-monotonicity which could allow an optimization algorithm to get stuck in a valley of invalid potential models. (FYI: the log(penalty) is used on the x axis to account for the extreme low values, and there is a tiny shift to the penalty, about 0.0000001, before taking the log, which allows me to plot log(0) among my points in a consistent fashion.)

image

Returning to the Gamma-Gamma fitter in particular, my intuition about how to think about the penalizer theory is that we would want to penalize models which amplify differences, and for example, one difference would be models which heighten individual weight relative to community weight. This might suggest that the penalty should not land on the three parameters equally, and might be applied to the individual weight (for example the penalization would be applied to p rather than all values evenly.) Penalizing q seems to require a constraint like q>1. And penalizing v looks strange because from those two formulae it seems to specifically depress the population mean.

I imagine such logic should also be applied to other penalizers in Lifetimes. I have chosen the Gamma-Gamma model for discussion here because the unintended parameter distortion caused by the penalizer is more obvious.

extrospective avatar Jul 22 '20 15:07 extrospective