patsy
patsy copied to clipboard
Make DesignMatrixBuilders pickleable and saveable
Use cases:
- It should be possible to pickle a
DesignMatrixBuilder
(and/orDesignInfo
, same issue) - Checking if two designs are the same: this comes up for rERPy -- it's only valid to form a grand average across multiple analyses if the underlying regressions were the same. In particular it would be good to be able to check for subtle gotchas like use of
center(...)
with different means across the different analyses.
The easy part of this is reviewing the inner structure of DesignMatrixBuilder
(column builders and all that) to make sure it's sensible, and similarly for factor state dicts.
The more complicated part is capturing the evaluation environment in a reasonable way.
Precondition: #25
Hi there,
I was wondering if there are any news regarding this issue. I also saw some discussions in: https://groups.google.com/forum/#!topic/pydata/kcy79nrcFf4 Are there any existing workarounds to overcome this issue?
Many thanks in advance!
Hey folks, The dill module (https://pypi.python.org/pypi/dill) is able to pickle DesignMatrixBuilders. Just tried it!
Really? That's bad!
The problem isn't so much making it work at all, it's making it work in such a way that a DesignMatrixBuilder pickled with patsy version X will unpickle correctly on patsy version Y. If you use dill now it will probably break later. (We should just have an error message, but I was lazy about adding this because pickle didn't really work anyway.)
Also, the way dill "works" at the moment is basically to pickle the whole universe just accidentally (b/c it will pickle every local variable that happens to be sitting around in the environment where you used patsy, which may well include your gigabyte sized dataset or whatever).
On Wed, Apr 15, 2015 at 3:25 PM, Doron-Wiser [email protected] wrote:
Hey folks, The dill module (https://pypi.python.org/pypi/dill) is able to pickle DesignMatrixBuilders. Just tried it!
— Reply to this email directly or view it on GitHub https://github.com/pydata/patsy/issues/26#issuecomment-93538845.
Nathaniel J. Smith -- http://vorpus.org
Yeah. Don't use dill to pickle v0.3 DesignMatrixBuilder objects. Patsy v0.4 will support pickling. (The harder bits are done.)
@njsmith I guess now that #25 is taken care of, the remaining bit is adding __setstate__
and friends to the right classes, no? Or do you see more to it? Which classes should that include? Just DesignMatrixBuilder
? DesignInfo
also? Or more? I'll do a first pass at a branch to solve this (and submit a pull request) once I have a bit of a better idea of what you have in mind.
Created pull request #67 to work on this.
Hey folks (particularly @chrish42), how's the progress on this issue? I saw PR #67 is still open and has been stalled for a while. How much work is left here for this feature to be functional?
The remaining step is to write unit tests for the serialization objects, to make sure that patsy doesn't (unknowingly) break support for formulas, etc. pickled with past versions. I'm been kept busy with other things, but my goal is to get this finished before PyCon 2016, so I'm starting work on this piece in the coming weekends.
@chrish42: that would be great! Of course feel free to ask for help as well if you are stalled -- maybe @alexdamour wants to help, for example ;-)
I'm using the PyCon sprints to start working on this again. As a first step, I'm cleaning up the description of pull request #67 to have an actual list of tasks that must be done to close this. My next step is to update the pull request with enough code so people can see what the approach would look like.
@chrish42 hey guys, wondering if there's any new update on this front? thanks!
Sure. I've had a very productive sprint at PyCon. I know the "0 of 11 tasks complete" hasn't moved, but if you go look at the "code" tab of the pull request, you'll now see a pretty fleshed out testing framework for pickling. Once @njsmith is happy with that part, I can start implementing __getstate__
and __setstate__
for all the patsy objects that we want to support saving to disk (very easy to do) and adding a bunch of pickling testcases (easy too, with a proper framework for it).
If you want to follow the progress, have a look at the pull request, as this bug report will stay pretty quiet until we close it.
pleaseee fix this issue for the love of god !
NotImplementedError: Sorry, pickling not yet supported. See https://github.com/pydata/patsy/issues/26 if you want to help.
+1 Also need to pickle
+1
For anyone interested, I've made a fork of patsy
and merged this branch into the fork. I needed to make changes to the test cases so they worked properly for me, but this is fully working for me and we are using it in production.
I'm continuing to follow this thread so that we can switch back to using the patsy
main repository once the finalized solution gets merged in.
@christang I tried your branch, but it does not work for me. Can you confirm that this is supposed to work:
import pickle
from patsy import ModelDesc, Term, LookupFactor
response_term = [Term([LookupFactor('test')])]
pickle.dump(response_term, open('test_pickle.pkl', 'w'))
I still get the NonImplementedError
@saroele Thanks for the note. I believe this branch only adds support for design matrix/info so it may be your other objects still remain without pickling support. I can confirm that the code does not work for me.
@christang, can you help me out with this, please. I'm currently facing a NotImplementedError even when I believe to be doing the pickling right. Any advice would be greatly appreciated
y, X = dmatrices(formula_like, df_model, return_type="dataframe")
with open(models_path+filename, 'wb') as file: pickle.dump(X.design_info, file)
This is really needed guys. what's blocking this implementation?
@bertomartin Patsy maintenance is done on a purely-volunteer basis, and I haven't really had time to work on it (or even review PRs) in several years now. If someone needs this and has funding to spend on it, we could talk about some kind of consulting contract...
@njsmith thanks for the great work so far. Ok, I'll take a stab at it.
@bertomartin any news?
I have also tried looking into it,
import h5py
def save_patsy(patsy_step, filename):
"""Save the coefficients of a linear model into a .h5 file."""
with h5py.File(filename, 'w') as hf:
hf.create_dataset("design_info", data=patsy_step.design_info_)
def load_coefficients(patsy_step, filename):
"""Attach the saved coefficients to a linear model."""
with h5py.File(filename, 'r') as hf:
design_info = hf['design_info'][:]
patsy_step.design_info_ = design_info
save_patsy(pipe['patsy'], "clf.h5")
Perhaps something simple like this?
Howver, still not working.
Hi @petrhrobar . I recommend you check out formulaic if you are wanting support for pickling.
Here is a partial solution.
Before you first import patsy:
def fixed_factorinfo_repr(self,p,cycle):
assert not cycle
kwlist = [("factor", self.factor),
("type", self.type),
("state", self.state)
]
if self.type == "numerical":
kwlist.append(("num_columns", self.num_columns))
else:
kwlist.append(("categories", self.categories))
patsy.util.repr_pretty_impl(p, self, [], kwlist)
def fake_evalenvironment_repr(self):
return "EvalEnvironment([])"
import patsy
patsy.FactorInfo._repr_pretty_ = fixed_factorinfo_repr
patsy.EvalEnvironment.__repr__ = fake_evalenvironment_repr
Then, to "serialize" a DesignInfo, do:
serialized_design_info = repr(my_design_info)
To "deserialize" a DesignInfo, do:
from collections import OrderedDict
from patsy import DesignInfo, EvalFactor, Term, SubtermInfo, ContrastMatrix
from numpy import array
design_info_instance = eval(serialized_design_info)
To get a design matrix from a design info, you can use the method patsy.build_design_matrices([design_info], X, return_type = "matrix")
Depending on your usage of patsy, there may be other __repr__
methods you may need to monkey-patch like I did here. Obviously the implementation of __repr__
for EvalEnvironment
is going to be insufficient if you reference environment variables in your formulas. None of this works if patsy gets imported before the monkey-patching happens. Buyer beware.
The author of this library deserves major credit for almost completely implementing __repr__
, which is what made this workaround possible.