patsy icon indicating copy to clipboard operation
patsy copied to clipboard

ENH: Generate var_names from the data and partial predict

Open thequackdaddy opened this issue 7 years ago • 5 comments

Hello,

I have a proposal that really came about because of the way I've been interacting with patsy.

My datasets are kind of long and kind of wide. I have lots of fields that I use for expoloring stuff, but naturally they just don't work out.

I've been using bcolz because it stores the data in a columnar fashion making horizontal slices really easy. Before, I'd been creating a list of variables that I wanted, defining all the transforms that I needed in patsy, and then feeding that through. I can't load the entire dataset into memory just because its too wide and long and I might only be looking at 20-30 columns for any one model.

So I propose having patsy attempt to figure out which columns it needs from the data using this new var_names method which is available on DesignInfo, EvalFactor, and Term. In a nutshell, it gets a list of all the variables used, checks if that variable is defined in the EvalEnvironment, and if not, assumes it must be data.

I've called this var_names for now, but arguably maybe non_eval_var_names might be more accurate? Open to suggestions here.

One nice thing is that when using incr_dbuilder, it can automatically slice on the columns which makes the construction much faster (for me at least).

Here's a gist demo'ing this.

https://gist.github.com/thequackdaddy/2e601afff4fbbfe42ed31a9b2925967d

Let me know what you think.

thequackdaddy avatar Dec 29 '16 22:12 thequackdaddy

Codecov Report

Merging #98 into master will increase coverage by 0.03%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #98      +/-   ##
==========================================
+ Coverage   98.96%   98.99%   +0.03%     
==========================================
  Files          30       30              
  Lines        5585     5760     +175     
  Branches      775      803      +28     
==========================================
+ Hits         5527     5702     +175     
  Misses         35       35              
  Partials       23       23
Impacted Files Coverage Δ
patsy/user_util.py 100% <100%> (ø) :arrow_up:
patsy/test_build.py 98.1% <100%> (+0.1%) :arrow_up:
patsy/desc.py 98.42% <100%> (+0.07%) :arrow_up:
patsy/design_info.py 99.68% <100%> (+0.06%) :arrow_up:
patsy/build.py 99.62% <100%> (ø) :arrow_up:
patsy/eval.py 99.16% <100%> (+0.04%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4c613d0...544effd. Read the comment docs.

codecov-io avatar Dec 29 '16 22:12 codecov-io

I went ahead and built the partial function that I had alluded to in #93. This makes it much easier to create design matrices for statsmodels that show you the marginal differences whe you only change the levels of 1 (or more) factors.

Here's a basic example:

In [1]: from patsy import dmatrix
   ...: import pandas as pd
   ...: import numpy as np
   ...:
   ...: data = pd.DataFrame({'categorical': ['a', 'b', 'c', 'b', 'a'],
   ...:                      'integer': [1, 3, 7, 2, 1],
   ...:                      'flt': [1.5, 0.0, 3.2, 4.2, 0.7]})
   ...: dm = dmatrix('categorical * np.log(integer) + bs(flt, df=3, degree=3)',
   ...:  data)
   ...: dm.design_info.partial({'categorical': ['a', 'b', 'c']})
   ...:
Out[1]:
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [2]: dm.design_info.partial({'categorical': ['a', 'b'],
   ...:                         'integer': [1, 2, 3, 4]},
   ...:                        product=True)
Out[2]:
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.69314718,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  1.09861229,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  1.38629436,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  0.69314718,  0.69314718,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  1.09861229,  1.09861229,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  1.38629436,  1.38629436,
         0.        ,  0.        ,  0.        ,  0.        ]])

thequackdaddy avatar Mar 04 '17 23:03 thequackdaddy

@njsmith Also, it appears that travis isn't kicking off for this all of a sudden. Any ideas why this would be?

I'm fairly certain this will pass. Here is the branch in my travis.

thequackdaddy avatar Mar 04 '17 23:03 thequackdaddy

It seems like it would be simpler to query a ModelDesc for all the variables it uses, period? And then it'd be your job to ignore the ones that aren't present in your data set. This would also be more accurate, because

The even simpler (from the API design / coupling perspective) would be to pass in a dict-like that loads the data lazily on demand, like:

class LazyData(dict):
    def __missing__(self, key):
        try:
            return bcolz.load(key, file)
        except BcolzKeyNotFound:
            raise KeyError(key)

Would this work for you?

Is the partial part somehow tied to the var_names part? They look like separate changes to me, so should be in separate PRs?

This is also missing lots of tests, but let's not worry about that until after the high-level discussion...

njsmith avatar Mar 05 '17 20:03 njsmith

It seems like it would be simpler to query a ModelDesc for all the variables it uses, period? And then it'd be your job to ignore the ones that aren't present in your data set. This would also be more accurate, because

Hmm... I hadn't thought of that. That should be relatively easy to add/change based on what I've done so far. The heart of this is the var_names is on the EvalFactor class that looks at all the objects needed to evalulate the factor using the ast_names function. This is in turn used by the Term class... (and is used in turn by DesignInfo class). ModelDesc has a list of terms (lhs_termlist and rhs_termlist), so adding this would be easy.

I presume you're implying that I shouldn't be worrying about the EvalEnvironment variable and just return every dependent object--function and module alike? I was trying to return only "data"-ish things. Simply removing them from the output set manually set seems easy enough...

The even simpler (from the API design / coupling perspective) would be to pass in a dict-like that loads the data lazily on demand, like:

This is really clever, thanks! I'll try it. However, I don't think it solves the partial issue below.

Is the partial part somehow tied to the var_names part? They look like separate changes to me, so should be in separate PRs?

Yes. partial looks at each Term's var_names and decides whether the Term needs the variable or not. If yes, it pulls that Term using subset to create the design matrix only for that subset of columns using the variables specified. Otherwise, it returns columns full of zeros. The end result is a design matrix of the same width and column alignment as the model's DesignMatrix, but only with as many rows as needed to evaluate the partial predictions and the rest of the columns as zeros.

This is also missing lots of tests, but let's not worry about that until after the high-level discussion...

Sound good. Writing tests is not something I've excelled at. This is somewhat tested and I (think) there is coverage for most of the new lines--likely I missed a few. I added some asserts to some of the existing tests with the new functionality.

thequackdaddy avatar Mar 05 '17 21:03 thequackdaddy