pymc-marketing
pymc-marketing copied to clipboard
Option to exclude fit data when saving models
Related to https://github.com/pymc-labs/pymc-marketing/issues/1259
Data used for model fitting is included in the saved model file, but for CLV models this will increase model size dramatically. An option to exclude this data would be advantageous particularly when logging to MLFlow. It's important to note when loading a model without fit data, some plotting functionality will be lost and data will also need to be specified for the predictive methods, but these concerns are minor.
This issue is more relevant to CLV models, but I don't see why MMMs can't be supported as well.
The load logic of MMM would break as it rebuilds the model upon load. Does clv not do that?
It seems that this will also break the CLV loading as well as the CLV model is built upon loading the data.
https://github.com/pymc-labs/pymc-marketing/blob/7dfa9558b307ee8618fcbc1c5b92b062c1a2dfa4/pymc_marketing/clv/models/basic.py#L168-L197
What behavior to you expect here?
_build_with_idata will require modification:
https://github.com/pymc-labs/pymc-marketing/blob/7dfa9558b307ee8618fcbc1c5b92b062c1a2dfa4/pymc_marketing/clv/models/basic.py#L200-L201
A conditional can be added for fit data, and if unavailable, it should be possible to instantiate a model with an empty dataframe if the column names are correct. Will need to test this for covariates.
The load logic of MMM would break as it rebuilds the model upon load. Does clv not do that?
May be best to do this in separate PRs for CLV and MMM since the load methods are different.
If the posterior is intact but the data is not built (because there is no data set to rebuild) would any of the methods work? I am confused
@ColtAllen , how do you expect this to work? There needs to be some data to build the model. Would there be a subset of the training data in order to handle this build?
@ColtAllen , how do you expect this to work? There needs to be some data to build the model. Would there be a subset of the training data in order to handle this build?
A loaded model can still be initialized without idata.fit_data, but in such a case it should raise a UserWarning to call build_model again.
After doing some testing, I've decided the best approach is to modify the CLV API so data can be passed into build_model and fit like all other models. It's the only way to preserve plotting and PPC functionality in a loaded model, and would also clean up the ModelBuilder internals: Among other things, idata.fit_data.to_dataframe() is being called twice in the loader method, and the only way around this is to stop passing data into __init__.