lifetimes Have any "succesor libraries" emerged, as Cam suggested?

Feb 24 '21 08:02 shgidi

UPDATE: pymc-marketing will become the new successor to this library.

I know this post is nearly a year old, but I would be happy to collaborate with others on a successor library built in PyMC.

I've recently started working on a CLV project and already foresee the time-based splitting of calibration and holdout data as a considerable limitation. Random and/or stratified sampling to ensure calibration and holdout data are equally distributed would be my priority, but the built-in statistical functions of PyMC would lend themselves well to this project, and model training can be distributed across GPUs and dramatically reduce training time.

I'm still proceeding with lifetimes as-is for the beta release of my CLV project, so I won't have much time to dedicate to a successor library until Mar 2022, but if anyone is interested, please respond to this issue.

Jan 31 '22 23:01 ColtAllen

@ColtAllen feel free to contact me

Feb 07 '22 10:02 shgidi

@ColtAllen, I am personally more interested in a TensorFlow Probability-based successor, having not worked with Pyro much, but I would be interested in assisting and seeing where there may be overlap.

Feb 07 '22 22:02 gpyga

I have been working, albeit slowly, on building a successor on Dask instead of Pandas. I see the challenge of doing CLV on millions of users and not being able to fit things in memory. The idea of Pyro sounds very compelling. How would you like to organize the project?

Feb 08 '22 06:02 rorcde

@shgidi @gpyga @rodrigorivera Want to plan a Zoom call to discuss this further? I’m in the Denver area, Mountain Standard Time (UTC-7:00). I have a draft prepared of the details I want to discuss, but I’ll provide an overview here and address your comments.

@ColtAllen, I am personally more interested in a TensorFlow Probability-based successor, having not worked with Pyro much, but I would be interested in assisting and seeing where there may be overlap.

Pyro is to PyTorch what TFProb is to TensorFlow. If this project takes off, then support for both libraries would be a great direction to go. I personally prefer Pyro because open-source is only as good as the supporting documentation. I starting working with TFProb back in 2017 when it was still called Edward, but have since moved away from it because the vague yet verbose documentation - which even has a few broken links - created considerable friction in my projects:

https://www.tensorflow.org/probability/overview

The documentation for Pyro on the other hand is among the best I’ve ever seen for an open-source library:

https://docs.pyro.ai/en/stable/

Both packages are also relatively low-level. Base TF can be cumbersome to work with, whereas PyTorch was expressly written to have a syntax similar to NumPy:

https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html

Speaking of numpy:

https://examples.dask.org/array.html

I have been working, albeit slowly, on building a successor on Dask instead of Pandas. I see the challenge of doing CLV on millions of users and not being able to fit things in memory.

Dask is basically a distributed drop-in replacement for numpy and would be an excellent alternative for the RFM aggregations. My current project has over 88 million transactions, so my team had to create a separate RFM feature store just to use lifetimes.

The idea of Pyro sounds very compelling. How would you like to organize the project?

In the Zoom call, I want to address and attain common agreement in the following areas:

Problems
Goals
Contributing

I’ve reviewed the GitHub issues for lifetimes in detail, and we each have our own lists of problems to bring up I’m sure, but let’s not confuse issues with features we’d like to see added.

I like the OKR approach for setting goals (qualitative Objectives and measurable Key Results) but I’m not married to the methodology by any means. A good objective would be to make lifetimes the premiere open-source library for stochastic RFM and CLV modeling. The number of models supported, reducing training times and the rate of convergence errors, and increasing the number of GitHub Stars and Watches are all ways we can measure this.

Lastly, the documentation for lifetimes is quite good, but I want to review the contributor’s guide in particular, make any desired changes, and ensure we’re all in alignment before going full-speed ahead with code development, because it will make PRs go much more smoothly in the future.

After these preliminaries are out of the way, we can put a task list together and set up GitHub Project pages for each. Looking forward to working with you all!

Feb 27 '22 18:02 ColtAllen

@ColtAllen I would also be interested on collaborating on a successor library, and would love to join an upcoming call (if the kickoff you mentioned hasn't happened yet)!

We use lifetimes in our CVM toolkit at my current company, but I was looking into how we may have access to a wider variety of methods than it currently implements (was looking at the R libraries like btydplus and CLVTools for inspiration). No strong opinion like the rest of you on backend thus far, although I have slightly more exposure to Dask than the other alternatives.

Mar 03 '22 15:03 deepyaman

@shgidi @gpyga @rodrigorivera Want to plan a Zoom call to discuss this further? I’m in the Denver area, Mountain Standard Time (UTC-7:00). I have a draft prepared of the details I want to discuss, but I’ll provide an overview here and address your comments.

Absolutely. I am in Central European Time. Should we aim at having a call on the second or third week of March?

Mar 03 '22 23:03 rorcde

@rodrigorivera Awesome! How about either March 13th or 20th for the Zoom meeting? Due to time zone differences, I see this happening around noontime for those in the Americas, and in the evening for those in Europe.

@deepyaman Hope you can join! I've been looking at the btydplus and CLVTools R libraries as well, and am even considering rpy2 (a Python API for R) as a band-aid for the MLE convergence issues I've been encountering in lifetimes so far.

Mar 05 '22 00:03 ColtAllen

March 13 works for me personally!

@ColtAllen I’ve used rpy2 in the past to use some epidemiological modeling package that—at least at the time—had no reasonable Python equivalent. My intuition is to steer clear of it for a successor to lifetimes, since requiring an R runtime for a Python package ends up being very inconvenient/limiting from a production deployment perspective (suddenly all the Docker images need to have R installed, etc.).

Mar 05 '22 03:03 deepyaman

@deepyaman Great! I'll let @rodrigorivera pick the time since this will be happening at the very end of his day, and I'll post the Zoom link here for anyone to join.

Also, I have little interest in integrating rpy2 into lifetimes; sorry for not clarifying that earlier. My director has R experience and floated the idea for our internal project deployment, but that's an excellent point you make about the added Dockerfile complexity. I'll be sure to bring it up.

If I had to pick another language to incorporate into lifetimes, it would be Stan, which prophet uses under the hood for MCMC inference of the hyperparameters:

https://github.com/facebook/prophet/blob/main/python/stan/unix/prophet.stan

Mar 05 '22 04:03 ColtAllen

@deepyaman @rodrigorivera @gpyga @shgidi I’m pushing back this Zoom call because I’ve sent collaboration invites to others and want to give them the opportunity to join as well. If I don’t hear from any of them by St. Patrick’s Day, we can go forward with meeting on 20-Mar or any other Sunday you prefer.

I’ve been reviewing the choices of backend for a successor library, and now believe pymc3 and/or Stan are the best options. I’ve found code implementations for the BG/NBD and Gamma-Gamma models in pymc3 and Stan, respectively, and have sent collaboration invites to the creators.

pymc3 has the cleanest, most Pythonic syntax of any statistical library I’ve worked with, but I stopped using it several years ago because it still used the deprecated theano tensor library as a backend. However, aesara - the successor backend they’ve developed - seems quite mature now, and both aesara & pymc3 have huge developer communities to reach out to for support. @CamDavidsonPilon himself has even written an eBook about pymc3; I do hope he’s able to join the Zoom call and/or assist in a technical advisory capacity.

Lastly, I've forked this repo and have invited you all to be collaborators:

https://github.com/ColtAllen/lifetimes

I haven’t done much yet aside from update the README, but I’ll be adding some new research paper links and making other minor documentation changes here shortly.

Mar 12 '22 18:03 ColtAllen

I appreciate the invitation to join the call and provide advice, but I don't think I would add much! I would like to express my excitement about a successor library being built with probabilistic programming tools - that was a future vision of mine for these RFM techniques. Best of luck, folks!

Mar 12 '22 19:03 CamDavidsonPilon

Zoom call is scheduled for Sunday, 27-Mar at 10 AM Mountain Standard Time (GMT-6:00)

I've been receiving messages from other interested parties on LinkedIn, so I'm delaying the Zoom call by one more week to give others the chance to discover this discussion and join.

I've already started working on a MCMC implementation of the Beta-Geo model. MCMC has challenges of its own, but according to this paper it has far less convergence issues than the current MLE approach, which will solve a lot of problems people have with using this library:

Worth the effort? Comparison of different MCMC algorithms for estimating the Pareto/NBD model

Join Zoom Meeting https://us02web.zoom.us/j/81938221716

Meeting ID: 819 3822 1716 One tap mobile +12532158782,,81938221716# US (Tacoma) +13462487799,,81938221716# US (Houston)

Dial by your location +1 253 215 8782 US (Tacoma) +1 346 248 7799 US (Houston) +1 669 900 6833 US (San Jose) +1 301 715 8592 US (Washington DC) +1 312 626 6799 US (Chicago) +1 929 436 2866 US (New York) Meeting ID: 819 3822 1716 Find your local number: https://us02web.zoom.us/u/kCp1rZoUe

Mar 20 '22 13:03 ColtAllen

Thanks @deepyaman, @juanitorduz, and everyone else for attending the Zoom call today. Here's a summary of what we discussed:

Identified Library Issues

Python syntax does not include type hinting.
Instability with scipy.hypf21 when using pandas inputs, particularly with GammaGammaFitter
Lack of options for plotting & quantifying uncertainty.
No standard error estimation of Hessian matrix during inference/optimization.
RFM aggregations computationally prohibitive with large datasets.
Plotting functions have extraneous dependencies on other methods in the library, limiting flexibility.
Difficult to determine if calibration and holdout datasets are equally distributed since they can only be split by time period, and is an incomplete approach to model evaluation.
MLE convergence not stable:
- autograd dependency deprecated two years ago.
- Current log-likelihood formulations can cause optimizers to crash.
- Current MLE penalizer assumptions ill-suited for parameter estimation.
- Model assumptions not being tested.

Development Priorities

Update documentation to add the contents of this message, updated contributor's guide, and links to research papers.
Coveralls integration for testing coverage (I'm working on this now.)
Merge PR of BetaGeo Time-invariant Covariates model submitted by @meremeev.
Add type hinting, and separate utility method dependencies from plotting methods.
pymc backend integration into BaseFitter class. Current MLE approach will be replaced with 'find_MAP' function in pymc4, which is expected to be released Apr/May 2022.
Expand model evaluations with the Gelman-Rubin statistic, posterior predictive checks, and other methods.
As the pymc4 overhaul is ongoing, be mindful of existing issues that have been identified, like log-likelihood formulations and scipy.hypf21 contributing to convergence instability. Add bug fixes whenever these problems arise.

Future Additions

Support for Hierarchical Bayesian models
Additional models
Poetry for model packaging
Nox for version testing (cannot use with poetry, so either of these two must be resolved)
Distribute RFM aggregation with dask
Stan backend integration (this will add considerable overhead to the project. If I don't get a lot of requests and PRs related to Stan I will not be pursuing this)

Future work will continue in the fork I've created: https://github.com/ColtAllen/lifetimes

Mar 27 '22 19:03 ColtAllen

An alpha release of the successor library - rebranded as btyd - is now available for pip install: https://github.com/ColtAllen/btyd

Jun 20 '22 00:06 ColtAllen

The btyd successor library is now in Beta: https://github.com/ColtAllen/btyd

Jul 29 '22 21:07 ColtAllen

Second beta release of the btyd successor library is now available for pip install: https://github.com/ColtAllen/btyd

Oct 08 '22 17:10 ColtAllen

Third beta release of btyd is now available for pip installation! This one includes a Bayesian variant of the Modified BG/NBD model, a few bug fixes, and some requested additions to the existing lifetimes models.

Nov 08 '22 17:11 ColtAllen

I've decided to merge efforts with the PyMC Labs team and work on the pymc-marketing project, which will become the premiere solution for CLV modeling going forward. BTYD has been a solo project of mine ever since I forked this library, but this is now a community effort!

@CamDavidsonPilon , please update the README to reflect this, thank you.

Jan 05 '23 18:01 ColtAllen

Neat! Looks like a fun project!

Jan 05 '23 21:01 CamDavidsonPilon

lifetimes lifetimes copied to clipboard

Have any "succesor libraries" emerged, as Cam suggested?

Identified Library Issues

Development Priorities

Future Additions

lifetimes
lifetimes copied to clipboard