bambi
bambi copied to clipboard
Keeping track of and saving bambi-estimated models
Hi bambi community! :deer:
This issue is to mostly open a discussion on the topic (feel free to suggest migration if this isn't the right place).
TL;DR: How do you keep track of the models that you run, including their properties (e.g., which EVs were included, no samples etc)?
I often run large batches of models. for example, I recently wanted to try ways to do variable-selection using all possible permutations of variables. What I was missing (and couldn't find online) was a way to keep track of the models that I estimated and save the samples in some accessible format. Additionally, I wanted a tool that would check whether I already estimated a given model and, if so, just load it - a sort of model-caching system. Lastly, a nice feature would be to use a model database with links to the actual data that could be shared, so that anyone with a permission could reproduce the results without the need to re-run the models (e.g., pulling the data using datalad).
Since I couldn't find anything of the sort for pymc/bambi models I coded a few functions and packaged them here. I have been using the package for a while successfully. It works for me because I added features that I find useful. So far, it just maintains a JSON with entries for estimated models (that can be neatly explored with a JSON viewer) and it saves the actual estimated samples in the pickle format.
I guess am wondering how others do this, and, whether the package above is something useful that would make sense to put some more time into.
Responses much appreciated!
This is definitely a very interesting discussion, and congrats for the library you wrote.
Saving and loading Bambi models and its associated data is one thing we definitely need to improve. The problem doesn't involve Bambi only though, it also involves how you save and load a PyMC model.
I'm also curious about whether other people face the same problem as you and how they solve it.
Thanks for opening the discussion!
Could you elaborate a bit more what you'd want saved and loaded. Some items below could be
- Model definition e.g. the model string
- Prior specifications?
- Traces diagnostics
Is there anything else?
@tomicapretto thanks :) yes agreed, i started with bambi since that's what I needed, but I have used this for pymc models as well
@canyon289 Definitely - here are some ideas I had that I can think of atm.
- information about estimation (samples, chains, sampler)
- model string, but also anything useful re the specific class of models (e.g., in linear models the terms separately, what's the DV etc)
- data version used (e.g. datalad/git-annex commit hash)
- for specific classes of models, breakdown of useful information (e.g. in liner models
- integration with datalad (information re data storage)
- the DB can be local, but it could also allow for for a remote one
- the DB should allow for hierarchy of models (i.e., model families; that's why I chose json)