patsy icon indicating copy to clipboard operation
patsy copied to clipboard

Patsy pickling

Open chrish42 opened this issue 9 years ago • 20 comments

Branch for the rest of the work required to make patsy pickles work well. At a high level, we need to implement __getstate__ and __setstate__ for relevant classes (easy), and add a testing framework (that is easy to use) that makes sure that pickles created with one version of patsy continue to work with newer versions (harder).

Pickling testsuite:

  • [ ] Create dict of pickling tests, with the key being the test name, and the value being an object of the above list, in a given configuration.
  • [ ] Create a test helper that iterates through the above, saving a pickle for each in a file pickles/vX.y/test_name.pickle (where "test_name" is the key in the dict for the corresponding object). This program would be run before every release (with the version number of the release as an argument), to create a new set of pickle files for that release. Said pickle files would be added to the source repository.
  • [ ] Create an adapter that makes nosetests iterate through all these objects, and check that they compare equal to all the versions of their corresponding pickles.

Pickle support:

  • [ ] Implement __getstate__ and __setstate__ methods for EvalFactor
  • [ ] Implement __getstate__ and __setstate__ methods for LookupFactor
  • [ ] Implement __getstate__ and __setstate__ methods for Term
  • [ ] Implement __getstate__ and __setstate__ methods for EvalEnvironment
  • [ ] Implement __getstate__ and __setstate__ methods for the stateful transform classes
  • [ ] Implement __getstate__ and __setstate__ methods for ContrastMatrix
  • [ ] Implement __getstate__ and __setstate__ methods for DesignInfo
  • [ ] Implement __getstate__ and __setstate__ methods for DesignMatrix

chrish42 avatar May 21 '15 02:05 chrish42

Coverage Status

Coverage increased (+0.0%) to 97.87% when pulling c9e91fc6d5aa59591dc3ef68a3441cd82c53cd3d on chrish42:patsy-pickles into 3819ae9ec5577891831ba60fc409576bd74f5896 on pydata:master.

coveralls avatar May 21 '15 07:05 coveralls

Hey sorry, I'll get to this soon but I'll need to spend some time thinking hard about the current organization of build.py first, so it might be a few more days yet.

njsmith avatar May 25 '15 05:05 njsmith

Any news?

chrish42 avatar Jun 06 '15 21:06 chrish42

Right, thanks for the ping :-).

I spent a few hours looking at this last night, and it's complicated!

The easy part: there are a bunch of simpler, lower-level objects that model pickling will rely on. For the most part I think we just need to add some tests for these (pickle/unpickle, and ideally having some historical pickles stored in the test suite to test that new versions of patsy continue to be able to unpickle them):

  • EvalFactor
  • LookupFactor
  • Term (and in particular making sure its .name() method continues to work)
  • EvalEnvironment
  • all the stateful transform classes (in state.py, splines.py, mgcv_cubic_splines.py) -- there's some shared testing code for these in test_state.py:check_stateful that could take care of a lot of the work
  • ContrastMatrix

Then there's the core design stuff, mostly in build.py, which currently includes:

  • the two FactorEvaluator classes
  • _ColumnBuilder
  • DesignMatrixBuilder

This is the hard part, because the internal representations here are a mess -- weird encoding of information, redundant encodings across different objects, and so on. This really needs some substantive refactoring.

And then there's a few high-level objects that need some attention:

  • DesignInfo -- a little complicated because it has a few different possible states -- sometimes only partial metadata is available -- but I think probably fine except for tests
  • DesignMatrix -- currently inherits ndarray.__reduce__, which causes it to lose the crucial .design_info attribute when going through a pickle/unpickle cycle, so that needs fixing

So I'm going to keep thinking about the best way to handle build.py, but at least this should give some picture of the landscape...

njsmith avatar Jun 09 '15 01:06 njsmith

It might also be worth doing a quick pass through the source to find any other classes that aren't on the above list, and add def __getstate__(self): raise NotImplementedError methods just to be safe...

njsmith avatar Jun 09 '15 01:06 njsmith

Okay, I sat down and refactored big chunks of build.py, so now this should be easy :-). Basically I added a bunch more public metadata to DesignInfo, which is useful in its own right (#61), and then I redid stuff in build.py so that it can build matrices without ever using any non-public information. Which makes this much simpler.

I also went through and added raise-an-error __getstate__ methods to everything.

So now I think the todo list is just:

  • Go through all the classes mentioned above one at a time, removing the __getstate__ method and adding pickle/unpickle tests.
  • That's it.

njsmith avatar Jun 14 '15 02:06 njsmith

That's great! I don't have time to look at this right now, but I should have more time around the next weekend.

chrish42 avatar Jun 15 '15 00:06 chrish42

Cool, no worries.

If I get some time I might go ahead and release 0.4.0, so people can at least start taking advantage of the new stuff in master. But that's no big deal, it would just mean the pickle stuff comes out in 0.5.0.

On Sun, Jun 14, 2015 at 5:03 PM, Christian Hudon [email protected] wrote:

That's great! I don't have time to look at this right now, but I should have more time around the next weekend.

— Reply to this email directly or view it on GitHub https://github.com/pydata/patsy/pull/67#issuecomment-111888565.

Nathaniel J. Smith -- http://vorpus.org

njsmith avatar Jun 15 '15 08:06 njsmith

Working on this. Quick questions. Should I do a __getstate__ implementation for everything, or is there some stuff that doesn't need to be pickled and can stay with __getstate__ = no_pickling? Also, what do you think would be good tests for this?

chrish42 avatar Jun 21 '15 02:06 chrish42

I guess we need a __getstate__ (or __reduce__ or whatever) for everything that's transitively referenced from DesignInfo or DesignMatrix. From the analysis in my comment above, I think that means: EvalFactor, LookupFactor, Term, EvalEnvironment, the stateful transform classes, ContrastMatrix, DesignInfo, DesignMatrix.

For tests, I'd generally suggest round-trip tests, plus some tests that involve stored pickle files (or strings inside the test .py, or whatever) that we attempt to load (since this is the only way to make sure we haven't accidentally broken both the save and load pathways in parallel, which is surprisingly easy to do with pickle).

njsmith avatar Jun 24 '15 01:06 njsmith

Getting back to working on patsy. My plan is to write the pickling support and tests for a simple class (I picked EvalFactor) and show you that. Once we're happy with the way things are done for that class, I'll do the same approach for all the other classes.

So, while I'm in EvalFactor, I have a quick question. What's the self.code instance variable? It's not in the documentation of the class, and it also isn't used anywhere in said class. Do we still want to restore that on pickling?

Also, I assume we don't care about forward compatibility (i.e. it's okay if an older version of patsy can't read a pickle generated by a newer version), correct?

chrish42 avatar Jul 10 '15 01:07 chrish42

And for the tests, here's actually what I'm thinking of doing right now. You can tell me what you think, and if it sounds realistic.

We want to achieve two main things with the tests for pickling: 1) testing that pickling round-trips, and 2) testing that pickles created by old versions of patsy still work on newer version. I'd like not to repeat myself too much between these two, so here's what I'm thinking. I create a bunch of interesting patsy objects, all contained in (say) a dict. Then for each of these objects, I can test that obj == pickle.loads(pickle.dumps(obj)). I can also easily create a script that pickles all of these objects to disk, under a directory named after the version of patsy. That script would be run before each release. Then I can iterate over the directories for all the versions, and test that for each object, obj == pickle.loads(the_pickle_for_that_object_for_a_given_version).

Do you think this would work as a testing strategy? Or am I missing something?

chrish42 avatar Jul 10 '15 01:07 chrish42

EvalFactor.code looks used to me?

Forward compatibility: It'd be good if trying to load a new pickle in an old version at least gave an error? But we can always deal with that later at the time we decide to change something in a breaking fashion. (I guess if one wants to get really fancy one could put a version number field in every pickle so that old versions will not just error out, but also be able to display a comprehensible error message - "This version of patsy is too old..." -- but this is definitely an extra credit kind of thing.)

The testing strategy also looks good to me. I guess the one subtlety will be that if we add new stuff in a future release then we'll need to have some way to add new test objects that are only present in newer pickle dumps, while still keeping the old pickle dumps.

One possible approach: represent the test objects as little bits of constructor code -- "EvalFactor(...)", etc. -- and then have the version snapshot script dump both the pickles themselves and also the code used to create the objects that they should be compared to. (And then there can also be a test that loops over the same list for round-trips: for code in TEST_OBJECT_CODES: obj = eval(code); assert pickle.loads(pickle.dumps(obj)) == obj)

njsmith avatar Jul 17 '15 03:07 njsmith

Hi! I just came across this PR via this issue: https://github.com/pydata/patsy/issues/26

Is this still the PR that would solve that issue? Is there a way that I can help here?

louispotok avatar Apr 30 '16 19:04 louispotok

What's happened so far has happened here, yes :-). IIRC the general plan for how to handle this is clear, but stalled out with the work mostly not done. If you and @chrish42 want to work on it together, or if @chrish42 is busy and you want to just pick up his work and continue where he left off, then either way it would be great, and I'm happy to continue providing advice and review.

njsmith avatar Apr 30 '16 20:04 njsmith

I've updated the description at the top to actually talk about the work that remains to be done, after going through all the comments in the pull request and other sources. @njsmith, let me know if that approach makes at least some sense or not. :-) After thinking about it some more, I still think that giving each "pickle test" its own name and corresponding object in a Python file is a better approach overall than saving the code as text inside a pickle. This mirrors other tests more (they all have a name too). We can discuss this approach more once I've added code to show how it would work in practice. (Will be easier to talk about it then.)

chrish42 avatar Jun 04 '16 00:06 chrish42

@chrish42: top level comment looks reasonable, and if you have a better idea I am extremely happy to defer to it :-)

njsmith avatar Jun 04 '16 02:06 njsmith

This is not quite done or working yet, but it's starting to show the general approach.

chrish42 avatar Jun 04 '16 23:06 chrish42

@njsmith Now would be good time to have a look and give any comments about the general approach. If things look okay, we can start making more of the patsy objects pickeable, and add more pickling testcases. Comments and questions welcome!

chrish42 avatar Jun 06 '16 00:06 chrish42

I made a bunch of small comments, but the overall approach looks fine to me! Thanks for working on this, and sorry for being slow in reviewing.

njsmith avatar Jun 10 '16 07:06 njsmith