lifetimes
lifetimes copied to clipboard
Have any "succesor libraries" emerged, as Cam suggested?
UPDATE: pymc-marketing will become the new successor to this library.
I know this post is nearly a year old, but I would be happy to collaborate with others on a successor library built in PyMC.
I've recently started working on a CLV project and already foresee the time-based splitting of calibration and holdout data as a considerable limitation. Random and/or stratified sampling to ensure calibration and holdout data are equally distributed would be my priority, but the built-in statistical functions of PyMC would lend themselves well to this project, and model training can be distributed across GPUs and dramatically reduce training time.
I'm still proceeding with lifetimes as-is for the beta release of my CLV project, so I won't have much time to dedicate to a successor library until Mar 2022, but if anyone is interested, please respond to this issue.
@ColtAllen feel free to contact me
@ColtAllen, I am personally more interested in a TensorFlow Probability-based successor, having not worked with Pyro much, but I would be interested in assisting and seeing where there may be overlap.
I have been working, albeit slowly, on building a successor on Dask instead of Pandas. I see the challenge of doing CLV on millions of users and not being able to fit things in memory. The idea of Pyro sounds very compelling. How would you like to organize the project?
@shgidi @gpyga @rodrigorivera Want to plan a Zoom call to discuss this further? I’m in the Denver area, Mountain Standard Time (UTC-7:00). I have a draft prepared of the details I want to discuss, but I’ll provide an overview here and address your comments.
@ColtAllen, I am personally more interested in a TensorFlow Probability-based successor, having not worked with Pyro much, but I would be interested in assisting and seeing where there may be overlap.
Pyro is to PyTorch what TFProb is to TensorFlow. If this project takes off, then support for both libraries would be a great direction to go. I personally prefer Pyro because open-source is only as good as the supporting documentation. I starting working with TFProb back in 2017 when it was still called Edward, but have since moved away from it because the vague yet verbose documentation - which even has a few broken links - created considerable friction in my projects:
https://www.tensorflow.org/probability/overview
The documentation for Pyro on the other hand is among the best I’ve ever seen for an open-source library:
https://docs.pyro.ai/en/stable/
Both packages are also relatively low-level. Base TF can be cumbersome to work with, whereas PyTorch was expressly written to have a syntax similar to NumPy:
https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html
Speaking of numpy:
https://examples.dask.org/array.html
I have been working, albeit slowly, on building a successor on Dask instead of Pandas. I see the challenge of doing CLV on millions of users and not being able to fit things in memory.
Dask is basically a distributed drop-in replacement for numpy and would be an excellent alternative for the RFM aggregations. My current project has over 88 million transactions, so my team had to create a separate RFM feature store just to use lifetimes.
The idea of Pyro sounds very compelling. How would you like to organize the project?
In the Zoom call, I want to address and attain common agreement in the following areas:
- Problems
- Goals
- Contributing
I’ve reviewed the GitHub issues for lifetimes in detail, and we each have our own lists of problems to bring up I’m sure, but let’s not confuse issues with features we’d like to see added.
I like the OKR approach for setting goals (qualitative Objectives and measurable Key Results) but I’m not married to the methodology by any means. A good objective would be to make lifetimes the premiere open-source library for stochastic RFM and CLV modeling. The number of models supported, reducing training times and the rate of convergence errors, and increasing the number of GitHub Stars and Watches are all ways we can measure this.
Lastly, the documentation for lifetimes is quite good, but I want to review the contributor’s guide in particular, make any desired changes, and ensure we’re all in alignment before going full-speed ahead with code development, because it will make PRs go much more smoothly in the future.
After these preliminaries are out of the way, we can put a task list together and set up GitHub Project pages for each. Looking forward to working with you all!
@ColtAllen I would also be interested on collaborating on a successor library, and would love to join an upcoming call (if the kickoff you mentioned hasn't happened yet)!
We use lifetimes
in our CVM toolkit at my current company, but I was looking into how we may have access to a wider variety of methods than it currently implements (was looking at the R libraries like btydplus
and CLVTools
for inspiration). No strong opinion like the rest of you on backend thus far, although I have slightly more exposure to Dask than the other alternatives.
@shgidi @gpyga @rodrigorivera Want to plan a Zoom call to discuss this further? I’m in the Denver area, Mountain Standard Time (UTC-7:00). I have a draft prepared of the details I want to discuss, but I’ll provide an overview here and address your comments.
Absolutely. I am in Central European Time. Should we aim at having a call on the second or third week of March?
@rodrigorivera Awesome! How about either March 13th or 20th for the Zoom meeting? Due to time zone differences, I see this happening around noontime for those in the Americas, and in the evening for those in Europe.
@deepyaman Hope you can join! I've been looking at the btydplus
and CLVTools
R libraries as well, and am even considering rpy2
(a Python API for R) as a band-aid for the MLE convergence issues I've been encountering in lifetimes
so far.
March 13 works for me personally!
@ColtAllen I’ve used rpy2
in the past to use some epidemiological modeling package that—at least at the time—had no reasonable Python equivalent. My intuition is to steer clear of it for a successor to lifetimes
, since requiring an R runtime for a Python package ends up being very inconvenient/limiting from a production deployment perspective (suddenly all the Docker images need to have R installed, etc.).
@deepyaman Great! I'll let @rodrigorivera pick the time since this will be happening at the very end of his day, and I'll post the Zoom link here for anyone to join.
Also, I have little interest in integrating rpy2
into lifetimes
; sorry for not clarifying that earlier. My director has R experience and floated the idea for our internal project deployment, but that's an excellent point you make about the added Dockerfile complexity. I'll be sure to bring it up.
If I had to pick another language to incorporate into lifetimes
, it would be Stan, which prophet
uses under the hood for MCMC inference of the hyperparameters:
https://github.com/facebook/prophet/blob/main/python/stan/unix/prophet.stan
@deepyaman @rodrigorivera @gpyga @shgidi I’m pushing back this Zoom call because I’ve sent collaboration invites to others and want to give them the opportunity to join as well. If I don’t hear from any of them by St. Patrick’s Day, we can go forward with meeting on 20-Mar or any other Sunday you prefer.
I’ve been reviewing the choices of backend for a successor library, and now believe pymc3
and/or Stan
are the best options. I’ve found code implementations for the BG/NBD and Gamma-Gamma models in pymc3
and Stan
, respectively, and have sent collaboration invites to the creators.
pymc3
has the cleanest, most Pythonic syntax of any statistical library I’ve worked with, but I stopped using it several years ago because it still used the deprecated theano tensor library as a backend. However, aesara
- the successor backend they’ve developed - seems quite mature now, and both aesara
& pymc3
have huge developer communities to reach out to for support. @CamDavidsonPilon himself has even written an eBook about pymc3
; I do hope he’s able to join the Zoom call and/or assist in a technical advisory capacity.
Lastly, I've forked this repo and have invited you all to be collaborators:
https://github.com/ColtAllen/lifetimes
I haven’t done much yet aside from update the README, but I’ll be adding some new research paper links and making other minor documentation changes here shortly.
I appreciate the invitation to join the call and provide advice, but I don't think I would add much! I would like to express my excitement about a successor library being built with probabilistic programming tools - that was a future vision of mine for these RFM techniques. Best of luck, folks!
Zoom call is scheduled for Sunday, 27-Mar at 10 AM Mountain Standard Time (GMT-6:00)
I've been receiving messages from other interested parties on LinkedIn, so I'm delaying the Zoom call by one more week to give others the chance to discover this discussion and join.
I've already started working on a MCMC implementation of the Beta-Geo model. MCMC has challenges of its own, but according to this paper it has far less convergence issues than the current MLE approach, which will solve a lot of problems people have with using this library:
Worth the effort? Comparison of different MCMC algorithms for estimating the Pareto/NBD model
Join Zoom Meeting https://us02web.zoom.us/j/81938221716
Meeting ID: 819 3822 1716 One tap mobile +12532158782,,81938221716# US (Tacoma) +13462487799,,81938221716# US (Houston)
Dial by your location +1 253 215 8782 US (Tacoma) +1 346 248 7799 US (Houston) +1 669 900 6833 US (San Jose) +1 301 715 8592 US (Washington DC) +1 312 626 6799 US (Chicago) +1 929 436 2866 US (New York) Meeting ID: 819 3822 1716 Find your local number: https://us02web.zoom.us/u/kCp1rZoUe
Thanks @deepyaman, @juanitorduz, and everyone else for attending the Zoom call today. Here's a summary of what we discussed:
Identified Library Issues
- Python syntax does not include type hinting.
- Instability with
scipy.hypf21
when usingpandas
inputs, particularly withGammaGammaFitter
- Lack of options for plotting & quantifying uncertainty.
- No standard error estimation of Hessian matrix during inference/optimization.
- RFM aggregations computationally prohibitive with large datasets.
- Plotting functions have extraneous dependencies on other methods in the library, limiting flexibility.
- Difficult to determine if calibration and holdout datasets are equally distributed since they can only be split by time period, and is an incomplete approach to model evaluation.
- MLE convergence not stable:
-
autograd
dependency deprecated two years ago. - Current log-likelihood formulations can cause optimizers to crash.
- Current MLE penalizer assumptions ill-suited for parameter estimation.
- Model assumptions not being tested.
-
Development Priorities
- Update documentation to add the contents of this message, updated contributor's guide, and links to research papers.
- Coveralls integration for testing coverage (I'm working on this now.)
- Merge PR of BetaGeo Time-invariant Covariates model submitted by @meremeev.
- Add type hinting, and separate utility method dependencies from plotting methods.
- pymc backend integration into
BaseFitter
class. Current MLE approach will be replaced with 'find_MAP' function inpymc4
, which is expected to be released Apr/May 2022. - Expand model evaluations with the Gelman-Rubin statistic, posterior predictive checks, and other methods.
- As the
pymc4
overhaul is ongoing, be mindful of existing issues that have been identified, like log-likelihood formulations andscipy.hypf21
contributing to convergence instability. Add bug fixes whenever these problems arise.
Future Additions
- Support for Hierarchical Bayesian models
- Additional models
- Poetry for model packaging
- Nox for version testing (cannot use with poetry, so either of these two must be resolved)
- Distribute RFM aggregation with dask
- Stan backend integration (this will add considerable overhead to the project. If I don't get a lot of requests and PRs related to Stan I will not be pursuing this)
Future work will continue in the fork I've created: https://github.com/ColtAllen/lifetimes
An alpha release of the successor library - rebranded as btyd
- is now available for pip
install:
https://github.com/ColtAllen/btyd
The btyd
successor library is now in Beta:
https://github.com/ColtAllen/btyd
Second beta release of the btyd
successor library is now available for pip
install:
https://github.com/ColtAllen/btyd
Third beta release of btyd
is now available for pip installation! This one includes a Bayesian variant of the Modified BG/NBD model, a few bug fixes, and some requested additions to the existing lifetimes
models.
I've decided to merge efforts with the PyMC Labs team and work on the pymc-marketing project, which will become the premiere solution for CLV modeling going forward. BTYD has been a solo project of mine ever since I forked this library, but this is now a community effort!
@CamDavidsonPilon , please update the README to reflect this, thank you.
Neat! Looks like a fun project!