patsy
patsy copied to clipboard
Patsy pickling
Branch for the rest of the work required to make patsy pickles work well. At a high level, we need to implement __getstate__
and __setstate__
for relevant classes (easy), and add a testing framework (that is easy to use) that makes sure that pickles created with one version of patsy continue to work with newer versions (harder).
Pickling testsuite:
- [ ] Create dict of pickling tests, with the key being the test name, and the value being an object of the above list, in a given configuration.
- [ ] Create a test helper that iterates through the above, saving a pickle for each in a file
pickles/vX.y/test_name.pickle
(where "test_name" is the key in the dict for the corresponding object). This program would be run before every release (with the version number of the release as an argument), to create a new set of pickle files for that release. Said pickle files would be added to the source repository. - [ ] Create an adapter that makes nosetests iterate through all these objects, and check that they compare equal to all the versions of their corresponding pickles.
Pickle support:
- [ ] Implement
__getstate__
and__setstate__
methods forEvalFactor
- [ ] Implement
__getstate__
and__setstate__
methods forLookupFactor
- [ ] Implement
__getstate__
and__setstate__
methods forTerm
- [ ] Implement
__getstate__
and__setstate__
methods forEvalEnvironment
- [ ] Implement
__getstate__
and__setstate__
methods for the stateful transform classes - [ ] Implement
__getstate__
and__setstate__
methods forContrastMatrix
- [ ] Implement
__getstate__
and__setstate__
methods forDesignInfo
- [ ] Implement
__getstate__
and__setstate__
methods forDesignMatrix
Coverage increased (+0.0%) to 97.87% when pulling c9e91fc6d5aa59591dc3ef68a3441cd82c53cd3d on chrish42:patsy-pickles into 3819ae9ec5577891831ba60fc409576bd74f5896 on pydata:master.
Hey sorry, I'll get to this soon but I'll need to spend some time thinking hard about the current organization of build.py
first, so it might be a few more days yet.
Any news?
Right, thanks for the ping :-).
I spent a few hours looking at this last night, and it's complicated!
The easy part: there are a bunch of simpler, lower-level objects that model pickling will rely on. For the most part I think we just need to add some tests for these (pickle/unpickle, and ideally having some historical pickles stored in the test suite to test that new versions of patsy continue to be able to unpickle them):
-
EvalFactor
-
LookupFactor
-
Term
(and in particular making sure its.name()
method continues to work) -
EvalEnvironment
- all the stateful transform classes (in
state.py
,splines.py
,mgcv_cubic_splines.py
) -- there's some shared testing code for these intest_state.py:check_stateful
that could take care of a lot of the work -
ContrastMatrix
Then there's the core design stuff, mostly in build.py
, which currently includes:
- the two
FactorEvaluator
classes -
_ColumnBuilder
-
DesignMatrixBuilder
This is the hard part, because the internal representations here are a mess -- weird encoding of information, redundant encodings across different objects, and so on. This really needs some substantive refactoring.
And then there's a few high-level objects that need some attention:
-
DesignInfo
-- a little complicated because it has a few different possible states -- sometimes only partial metadata is available -- but I think probably fine except for tests -
DesignMatrix
-- currently inheritsndarray.__reduce__
, which causes it to lose the crucial.design_info
attribute when going through a pickle/unpickle cycle, so that needs fixing
So I'm going to keep thinking about the best way to handle build.py
, but at least this should give some picture of the landscape...
It might also be worth doing a quick pass through the source to find any other classes that aren't on the above list, and add def __getstate__(self): raise NotImplementedError
methods just to be safe...
Okay, I sat down and refactored big chunks of build.py
, so now this should be easy :-). Basically I added a bunch more public metadata to DesignInfo
, which is useful in its own right (#61), and then I redid stuff in build.py
so that it can build matrices without ever using any non-public information. Which makes this much simpler.
I also went through and added raise-an-error __getstate__
methods to everything.
So now I think the todo list is just:
- Go through all the classes mentioned above one at a time, removing the
__getstate__
method and adding pickle/unpickle tests. - That's it.
That's great! I don't have time to look at this right now, but I should have more time around the next weekend.
Cool, no worries.
If I get some time I might go ahead and release 0.4.0, so people can at least start taking advantage of the new stuff in master. But that's no big deal, it would just mean the pickle stuff comes out in 0.5.0.
On Sun, Jun 14, 2015 at 5:03 PM, Christian Hudon [email protected] wrote:
That's great! I don't have time to look at this right now, but I should have more time around the next weekend.
— Reply to this email directly or view it on GitHub https://github.com/pydata/patsy/pull/67#issuecomment-111888565.
Nathaniel J. Smith -- http://vorpus.org
Working on this. Quick questions. Should I do a __getstate__
implementation for everything, or is there some stuff that doesn't need to be pickled and can stay with __getstate__ = no_pickling
? Also, what do you think would be good tests for this?
I guess we need a __getstate__
(or __reduce__
or whatever) for everything that's transitively referenced from DesignInfo
or DesignMatrix
. From the analysis in my comment above, I think that means: EvalFactor
, LookupFactor
, Term
, EvalEnvironment
, the stateful transform classes, ContrastMatrix
, DesignInfo
, DesignMatrix
.
For tests, I'd generally suggest round-trip tests, plus some tests that involve stored pickle files (or strings inside the test .py, or whatever) that we attempt to load (since this is the only way to make sure we haven't accidentally broken both the save and load pathways in parallel, which is surprisingly easy to do with pickle).
Getting back to working on patsy. My plan is to write the pickling support and tests for a simple class (I picked EvalFactor
) and show you that. Once we're happy with the way things are done for that class, I'll do the same approach for all the other classes.
So, while I'm in EvalFactor
, I have a quick question. What's the self.code
instance variable? It's not in the documentation of the class, and it also isn't used anywhere in said class. Do we still want to restore that on pickling?
Also, I assume we don't care about forward compatibility (i.e. it's okay if an older version of patsy can't read a pickle generated by a newer version), correct?
And for the tests, here's actually what I'm thinking of doing right now. You can tell me what you think, and if it sounds realistic.
We want to achieve two main things with the tests for pickling: 1) testing that pickling round-trips, and 2) testing that pickles created by old versions of patsy still work on newer version. I'd like not to repeat myself too much between these two, so here's what I'm thinking. I create a bunch of interesting patsy objects, all contained in (say) a dict. Then for each of these objects, I can test that obj == pickle.loads(pickle.dumps(obj))
. I can also easily create a script that pickles all of these objects to disk, under a directory named after the version of patsy. That script would be run before each release. Then I can iterate over the directories for all the versions, and test that for each object, obj == pickle.loads(the_pickle_for_that_object_for_a_given_version)
.
Do you think this would work as a testing strategy? Or am I missing something?
EvalFactor.code
looks used to me?
Forward compatibility: It'd be good if trying to load a new pickle in an old version at least gave an error? But we can always deal with that later at the time we decide to change something in a breaking fashion. (I guess if one wants to get really fancy one could put a version number field in every pickle so that old versions will not just error out, but also be able to display a comprehensible error message - "This version of patsy is too old..." -- but this is definitely an extra credit kind of thing.)
The testing strategy also looks good to me. I guess the one subtlety will be that if we add new stuff in a future release then we'll need to have some way to add new test objects that are only present in newer pickle dumps, while still keeping the old pickle dumps.
One possible approach: represent the test objects as little bits of constructor code -- "EvalFactor(...)"
, etc. -- and then have the version snapshot script dump both the pickles themselves and also the code used to create the objects that they should be compared to. (And then there can also be a test that loops over the same list for round-trips: for code in TEST_OBJECT_CODES: obj = eval(code); assert pickle.loads(pickle.dumps(obj)) == obj
)
Hi! I just came across this PR via this issue: https://github.com/pydata/patsy/issues/26
Is this still the PR that would solve that issue? Is there a way that I can help here?
What's happened so far has happened here, yes :-). IIRC the general plan for how to handle this is clear, but stalled out with the work mostly not done. If you and @chrish42 want to work on it together, or if @chrish42 is busy and you want to just pick up his work and continue where he left off, then either way it would be great, and I'm happy to continue providing advice and review.
I've updated the description at the top to actually talk about the work that remains to be done, after going through all the comments in the pull request and other sources. @njsmith, let me know if that approach makes at least some sense or not. :-) After thinking about it some more, I still think that giving each "pickle test" its own name and corresponding object in a Python file is a better approach overall than saving the code as text inside a pickle. This mirrors other tests more (they all have a name too). We can discuss this approach more once I've added code to show how it would work in practice. (Will be easier to talk about it then.)
@chrish42: top level comment looks reasonable, and if you have a better idea I am extremely happy to defer to it :-)
This is not quite done or working yet, but it's starting to show the general approach.
@njsmith Now would be good time to have a look and give any comments about the general approach. If things look okay, we can start making more of the patsy objects pickeable, and add more pickling testcases. Comments and questions welcome!
I made a bunch of small comments, but the overall approach looks fine to me! Thanks for working on this, and sorry for being slow in reviewing.