pymc-marketing Add `BetaGeoBetaBinom` Distribution Block

This PR resurrects https://github.com/pymc-labs/pymc-marketing/pull/188 to partially address https://github.com/pymc-labs/pymc-marketing/issues/176.

The Beta-Geometic/Beta Binomial model is a CLV model for discrete, non-contractual use cases - a good example would be sporting events. This PR adds a distribution block to be used for posterior predictive checks and the logp in the full model which will be added in a separate PR.

For now this is a draft PR because I can't get the logp to return the correct values; for reference please see (5) on p4 of the research paper:

https://www.brucehardie.com/papers/020/fader_et_al_mksc_10.pdf

I could use some help with the pytensor functions this requires.

:books: Documentation preview :books:: https://pymc-marketing--431.org.readthedocs.build/en/431/

Nov 11 '23 08:11 ColtAllen

I'll try to check this week @ColtAllen

Nov 11 '23 12:11 ricardoV94

I pushed a version of the logp that works with Scan. I think we can make it work without a Scan, by evaluating on the maximum T-tx range. I just don't know if this results in much wasted computation in general use cases (for instance if there's one datapoint with a very large range, but most other datapoints have smaller ranges)

Nov 14 '23 18:11 ricardoV94

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.08%. Comparing base (9aa8bc0) to head (60a1f55).

:exclamation: Current head 60a1f55 differs from pull request most recent head c376695

Please upload reports for the commit c376695 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #431      +/-   ##
==========================================
- Coverage   92.25%   92.08%   -0.17%     
==========================================
  Files          24       24              
  Lines        2414     2490      +76     
==========================================
+ Hits         2227     2293      +66     
- Misses        187      197      +10

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Nov 14 '23 18:11 codecov[bot]

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Dec 11 '23 18:12 review-notebook-app[bot]

I added the likelihood expression to the docstrings, along with a dev notebook. Not sure if the LaTeX rendered properly in the docstrings though, because the docs build is failing due to a pytensor issue:

ImportError: cannot import name 'vectorize' from 'pytensor.graph' 
(/home/docs/checkouts/readthedocs.org/user_builds/pymc-marketing/envs/431/lib/python3.10/site-packages/pytensor/graph/__init__.py)

Dec 11 '23 18:12 ColtAllen

I pushed a version of the logp that works with Scan. I think we can make it work without a Scan, by evaluating on the maximum T-tx range. I just don't know if this results in much wasted computation in general use cases (for instance if there's one datapoint with a very large range, but most other datapoints have smaller ranges)

In general I don't see very large values of T being a common use case with this model. Is Scan the most performant approach? If so I'm fine leaving it as-is.

Dec 11 '23 19:12 ColtAllen

I added the likelihood expression to the docstrings, along with a dev notebook. Not sure if the LaTeX rendered properly in the docstrings though, because the docs build is failing due to a pytensor issue:
ImportError: cannot import name 'vectorize' from 'pytensor.graph' 
(/home/docs/checkouts/readthedocs.org/user_builds/pymc-marketing/envs/431/lib/python3.10/site-packages/pytensor/graph/__init__.py)

It's now called vectorize_graph

Dec 16 '23 16:12 ricardoV94

Is Scan the most performant approach? If so I'm fine leaving it as-is.

Scan is the simplest approach

Dec 16 '23 16:12 ricardoV94

Tests are still failing with vectorize_graph. Do the library dependencies require updating?

Dec 16 '23 17:12 ColtAllen

Yes you need to bump the oldest supported pymc version to the most recent

Dec 16 '23 17:12 ricardoV94

In here and in the ci wofkflow file: https://github.com/pymc-labs/pymc-marketing/blob/main/pyproject.toml#L20

Dec 16 '23 17:12 ricardoV94

I had to increase rtol because we're comparing against the data simulation function in lifetimes, and it lacks a random seed. If the sim_data function looks good and I'm indexing properly in this test, this should be good to go.

Feb 24 '24 18:02 ColtAllen

@ColtAllen this looks good! I wanna merge this one and iterate (I can help with that). I have just two requests :D :

Remove commented code

Fix pre-commit (I think Ruff is complaining)

Otherwise LGTM.

First point is done, but Ruff is complaining about a LaTeX expression in the docstring that is too long. I'm not sure if there's any way to shorten it that doesn't prevent the docstring from rendering properly.

May 26 '24 23:05 ColtAllen

First point is done, but Ruff is complaining about a LaTeX expression in the docstring that is too long. I'm not sure if there's any way to shorten it that doesn't prevent the docstring from rendering properly.

Doesn't seem to be possible to turn off and on at the moment based on this ruff issue. Maybe you can define the string outside of the class then use f-string. Could reduce 8 something characters on a line. Could turn off for the whole file if there is a docstring specific option

May 27 '24 00:05 williambdean

I think we can simply do it as in https://github.com/pymc-labs/pymc-marketing/blob/main/pymc_marketing/clv/distributions.py#L407

May 27 '24 08:05 juanitorduz

Thanks @ColtAllen ! Let's now work on the model 🎉 !

May 27 '24 18:05 juanitorduz

pymc-marketing pymc-marketing copied to clipboard

Add `BetaGeoBetaBinom` Distribution Block

Codecov Report

pymc-marketing
pymc-marketing copied to clipboard