pymc-marketing icon indicating copy to clipboard operation
pymc-marketing copied to clipboard

Option to exclude fit data when saving models

Open ColtAllen opened this issue 10 months ago • 7 comments
trafficstars

Related to https://github.com/pymc-labs/pymc-marketing/issues/1259

Data used for model fitting is included in the saved model file, but for CLV models this will increase model size dramatically. An option to exclude this data would be advantageous particularly when logging to MLFlow. It's important to note when loading a model without fit data, some plotting functionality will be lost and data will also need to be specified for the predictive methods, but these concerns are minor.

This issue is more relevant to CLV models, but I don't see why MMMs can't be supported as well.

ColtAllen avatar Jan 08 '25 18:01 ColtAllen

The load logic of MMM would break as it rebuilds the model upon load. Does clv not do that?

williambdean avatar Jan 08 '25 18:01 williambdean

It seems that this will also break the CLV loading as well as the CLV model is built upon loading the data.

https://github.com/pymc-labs/pymc-marketing/blob/7dfa9558b307ee8618fcbc1c5b92b062c1a2dfa4/pymc_marketing/clv/models/basic.py#L168-L197

What behavior to you expect here?

williambdean avatar Jan 08 '25 21:01 williambdean

_build_with_idata will require modification:

https://github.com/pymc-labs/pymc-marketing/blob/7dfa9558b307ee8618fcbc1c5b92b062c1a2dfa4/pymc_marketing/clv/models/basic.py#L200-L201

A conditional can be added for fit data, and if unavailable, it should be possible to instantiate a model with an empty dataframe if the column names are correct. Will need to test this for covariates.

ColtAllen avatar Jan 09 '25 02:01 ColtAllen

The load logic of MMM would break as it rebuilds the model upon load. Does clv not do that?

May be best to do this in separate PRs for CLV and MMM since the load methods are different.

ColtAllen avatar Jan 09 '25 02:01 ColtAllen

If the posterior is intact but the data is not built (because there is no data set to rebuild) would any of the methods work? I am confused

williambdean avatar Jan 18 '25 19:01 williambdean

@ColtAllen , how do you expect this to work? There needs to be some data to build the model. Would there be a subset of the training data in order to handle this build?

williambdean avatar Feb 04 '25 14:02 williambdean

@ColtAllen , how do you expect this to work? There needs to be some data to build the model. Would there be a subset of the training data in order to handle this build?

A loaded model can still be initialized without idata.fit_data, but in such a case it should raise a UserWarning to call build_model again.

After doing some testing, I've decided the best approach is to modify the CLV API so data can be passed into build_model and fit like all other models. It's the only way to preserve plotting and PPC functionality in a loaded model, and would also clean up the ModelBuilder internals: Among other things, idata.fit_data.to_dataframe() is being called twice in the loader method, and the only way around this is to stop passing data into __init__.

ColtAllen avatar Aug 05 '25 07:08 ColtAllen