patsy icon indicating copy to clipboard operation
patsy copied to clipboard

Make DesignMatrixBuilders pickleable and saveable

Open njsmith opened this issue 10 years ago • 24 comments

Use cases:

  • It should be possible to pickle a DesignMatrixBuilder (and/or DesignInfo, same issue)
  • Checking if two designs are the same: this comes up for rERPy -- it's only valid to form a grand average across multiple analyses if the underlying regressions were the same. In particular it would be good to be able to check for subtle gotchas like use of center(...) with different means across the different analyses.

The easy part of this is reviewing the inner structure of DesignMatrixBuilder (column builders and all that) to make sure it's sensible, and similarly for factor state dicts.

The more complicated part is capturing the evaluation environment in a reasonable way.

Precondition: #25

njsmith avatar Oct 06 '13 13:10 njsmith

Hi there,

I was wondering if there are any news regarding this issue. I also saw some discussions in: https://groups.google.com/forum/#!topic/pydata/kcy79nrcFf4 Are there any existing workarounds to overcome this issue?

Many thanks in advance!

egrublyte avatar Oct 23 '14 09:10 egrublyte

Hey folks, The dill module (https://pypi.python.org/pypi/dill) is able to pickle DesignMatrixBuilders. Just tried it!

ghost avatar Apr 15 '15 19:04 ghost

Really? That's bad!

The problem isn't so much making it work at all, it's making it work in such a way that a DesignMatrixBuilder pickled with patsy version X will unpickle correctly on patsy version Y. If you use dill now it will probably break later. (We should just have an error message, but I was lazy about adding this because pickle didn't really work anyway.)

Also, the way dill "works" at the moment is basically to pickle the whole universe just accidentally (b/c it will pickle every local variable that happens to be sitting around in the environment where you used patsy, which may well include your gigabyte sized dataset or whatever).

On Wed, Apr 15, 2015 at 3:25 PM, Doron-Wiser [email protected] wrote:

Hey folks, The dill module (https://pypi.python.org/pypi/dill) is able to pickle DesignMatrixBuilders. Just tried it!

— Reply to this email directly or view it on GitHub https://github.com/pydata/patsy/issues/26#issuecomment-93538845.

Nathaniel J. Smith -- http://vorpus.org

njsmith avatar Apr 15 '15 19:04 njsmith

Yeah. Don't use dill to pickle v0.3 DesignMatrixBuilder objects. Patsy v0.4 will support pickling. (The harder bits are done.)

@njsmith I guess now that #25 is taken care of, the remaining bit is adding __setstate__ and friends to the right classes, no? Or do you see more to it? Which classes should that include? Just DesignMatrixBuilder? DesignInfo also? Or more? I'll do a first pass at a branch to solve this (and submit a pull request) once I have a bit of a better idea of what you have in mind.

chrish42 avatar May 20 '15 02:05 chrish42

Created pull request #67 to work on this.

chrish42 avatar May 21 '15 02:05 chrish42

Hey folks (particularly @chrish42), how's the progress on this issue? I saw PR #67 is still open and has been stalled for a while. How much work is left here for this feature to be functional?

alexdamour avatar Mar 15 '16 20:03 alexdamour

The remaining step is to write unit tests for the serialization objects, to make sure that patsy doesn't (unknowingly) break support for formulas, etc. pickled with past versions. I'm been kept busy with other things, but my goal is to get this finished before PyCon 2016, so I'm starting work on this piece in the coming weekends.

chrish42 avatar Mar 16 '16 02:03 chrish42

@chrish42: that would be great! Of course feel free to ask for help as well if you are stalled -- maybe @alexdamour wants to help, for example ;-)

njsmith avatar Mar 16 '16 04:03 njsmith

I'm using the PyCon sprints to start working on this again. As a first step, I'm cleaning up the description of pull request #67 to have an actual list of tasks that must be done to close this. My next step is to update the pull request with enough code so people can see what the approach would look like.

chrish42 avatar Jun 04 '16 00:06 chrish42

@chrish42 hey guys, wondering if there's any new update on this front? thanks!

yongcho822 avatar Jun 08 '16 21:06 yongcho822

Sure. I've had a very productive sprint at PyCon. I know the "0 of 11 tasks complete" hasn't moved, but if you go look at the "code" tab of the pull request, you'll now see a pretty fleshed out testing framework for pickling. Once @njsmith is happy with that part, I can start implementing __getstate__ and __setstate__ for all the patsy objects that we want to support saving to disk (very easy to do) and adding a bunch of pickling testcases (easy too, with a proper framework for it).

If you want to follow the progress, have a look at the pull request, as this bug report will stay pretty quiet until we close it.

chrish42 avatar Jun 09 '16 02:06 chrish42

pleaseee fix this issue for the love of god !

NotImplementedError: Sorry, pickling not yet supported. See https://github.com/pydata/patsy/issues/26 if you want to help.

elexira avatar Dec 07 '16 01:12 elexira

+1 Also need to pickle

datascientette avatar Jan 23 '17 18:01 datascientette

+1

For anyone interested, I've made a fork of patsy and merged this branch into the fork. I needed to make changes to the test cases so they worked properly for me, but this is fully working for me and we are using it in production.

I'm continuing to follow this thread so that we can switch back to using the patsy main repository once the finalized solution gets merged in.

christang avatar Nov 20 '17 14:11 christang

@christang I tried your branch, but it does not work for me. Can you confirm that this is supposed to work:

import pickle
from patsy import ModelDesc, Term, LookupFactor
response_term = [Term([LookupFactor('test')])]
pickle.dump(response_term, open('test_pickle.pkl', 'w'))

I still get the NonImplementedError

saroele avatar Mar 05 '18 15:03 saroele

@saroele Thanks for the note. I believe this branch only adds support for design matrix/info so it may be your other objects still remain without pickling support. I can confirm that the code does not work for me.

christang avatar Mar 06 '18 11:03 christang

@christang, can you help me out with this, please. I'm currently facing a NotImplementedError even when I believe to be doing the pickling right. Any advice would be greatly appreciated

y, X = dmatrices(formula_like, df_model, return_type="dataframe")

with open(models_path+filename, 'wb') as file: pickle.dump(X.design_info, file)

ciberger avatar Sep 19 '18 16:09 ciberger

This is really needed guys. what's blocking this implementation?

bertomartin avatar May 08 '19 21:05 bertomartin

@bertomartin Patsy maintenance is done on a purely-volunteer basis, and I haven't really had time to work on it (or even review PRs) in several years now. If someone needs this and has funding to spend on it, we could talk about some kind of consulting contract...

njsmith avatar May 08 '19 23:05 njsmith

@njsmith thanks for the great work so far. Ok, I'll take a stab at it.

bertomartin avatar May 10 '19 15:05 bertomartin

@bertomartin any news?

insperatum avatar Jul 11 '20 23:07 insperatum

I have also tried looking into it,

import h5py

def save_patsy(patsy_step, filename):
    """Save the coefficients of a linear model into a .h5 file."""
    with h5py.File(filename, 'w') as hf:
        hf.create_dataset("design_info",  data=patsy_step.design_info_)

def load_coefficients(patsy_step, filename):
    """Attach the saved coefficients to a linear model."""
    with h5py.File(filename, 'r') as hf:
        design_info = hf['design_info'][:]
    patsy_step.design_info_ = design_info


save_patsy(pipe['patsy'], "clf.h5")

Perhaps something simple like this?

Howver, still not working.

petrhrobar avatar Apr 24 '22 19:04 petrhrobar

Hi @petrhrobar . I recommend you check out formulaic if you are wanting support for pickling.

matthewwardrop avatar Apr 27 '22 03:04 matthewwardrop

Here is a partial solution.

Before you first import patsy:

def fixed_factorinfo_repr(self,p,cycle):
    assert not cycle
    kwlist = [("factor", self.factor),
              ("type", self.type),
              ("state", self.state)
              ]
    if self.type == "numerical":
        kwlist.append(("num_columns", self.num_columns))
    else:
        kwlist.append(("categories", self.categories))
    patsy.util.repr_pretty_impl(p, self, [], kwlist)
    
def fake_evalenvironment_repr(self):
    return "EvalEnvironment([])"

import patsy
patsy.FactorInfo._repr_pretty_ = fixed_factorinfo_repr
patsy.EvalEnvironment.__repr__ = fake_evalenvironment_repr

Then, to "serialize" a DesignInfo, do:

serialized_design_info = repr(my_design_info)

To "deserialize" a DesignInfo, do:

from collections import OrderedDict
from patsy import DesignInfo, EvalFactor, Term, SubtermInfo, ContrastMatrix
from numpy import array
design_info_instance = eval(serialized_design_info)

To get a design matrix from a design info, you can use the method patsy.build_design_matrices([design_info], X, return_type = "matrix")

Depending on your usage of patsy, there may be other __repr__ methods you may need to monkey-patch like I did here. Obviously the implementation of __repr__ for EvalEnvironment is going to be insufficient if you reference environment variables in your formulas. None of this works if patsy gets imported before the monkey-patching happens. Buyer beware.

The author of this library deserves major credit for almost completely implementing __repr__, which is what made this workaround possible.

kyle-pena-nlp avatar May 11 '23 17:05 kyle-pena-nlp