patsy
patsy copied to clipboard
ENH: Generate var_names from the data and partial predict
Hello,
I have a proposal that really came about because of the way I've been interacting with patsy.
My datasets are kind of long and kind of wide. I have lots of fields that I use for expoloring stuff, but naturally they just don't work out.
I've been using bcolz because it stores the data in a columnar fashion making horizontal slices really easy. Before, I'd been creating a list of variables that I wanted, defining all the transforms that I needed in patsy, and then feeding that through. I can't load the entire dataset into memory just because its too wide and long and I might only be looking at 20-30 columns for any one model.
So I propose having patsy attempt to figure out which columns it needs from the data using this new var_names
method which is available on DesignInfo
, EvalFactor
, and Term
. In a nutshell, it gets a list of all the variables used, checks if that variable is defined in the EvalEnvironment
, and if not, assumes it must be data.
I've called this var_names
for now, but arguably maybe non_eval_var_names
might be more accurate? Open to suggestions here.
One nice thing is that when using incr_dbuilder
, it can automatically slice on the columns which makes the construction much faster (for me at least).
Here's a gist demo'ing this.
https://gist.github.com/thequackdaddy/2e601afff4fbbfe42ed31a9b2925967d
Let me know what you think.
Codecov Report
Merging #98 into master will increase coverage by
0.03%
. The diff coverage is100%
.
@@ Coverage Diff @@
## master #98 +/- ##
==========================================
+ Coverage 98.96% 98.99% +0.03%
==========================================
Files 30 30
Lines 5585 5760 +175
Branches 775 803 +28
==========================================
+ Hits 5527 5702 +175
Misses 35 35
Partials 23 23
Impacted Files | Coverage Δ | |
---|---|---|
patsy/user_util.py | 100% <100%> (ø) |
:arrow_up: |
patsy/test_build.py | 98.1% <100%> (+0.1%) |
:arrow_up: |
patsy/desc.py | 98.42% <100%> (+0.07%) |
:arrow_up: |
patsy/design_info.py | 99.68% <100%> (+0.06%) |
:arrow_up: |
patsy/build.py | 99.62% <100%> (ø) |
:arrow_up: |
patsy/eval.py | 99.16% <100%> (+0.04%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 4c613d0...544effd. Read the comment docs.
I went ahead and built the partial
function that I had alluded to in #93. This makes it much easier to create design matrices for statsmodels
that show you the marginal differences whe you only change the levels of 1 (or more) factors.
Here's a basic example:
In [1]: from patsy import dmatrix
...: import pandas as pd
...: import numpy as np
...:
...: data = pd.DataFrame({'categorical': ['a', 'b', 'c', 'b', 'a'],
...: 'integer': [1, 3, 7, 2, 1],
...: 'flt': [1.5, 0.0, 3.2, 4.2, 0.7]})
...: dm = dmatrix('categorical * np.log(integer) + bs(flt, df=3, degree=3)',
...: data)
...: dm.design_info.partial({'categorical': ['a', 'b', 'c']})
...:
Out[1]:
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0.]])
In [2]: dm.design_info.partial({'categorical': ['a', 'b'],
...: 'integer': [1, 2, 3, 4]},
...: product=True)
Out[2]:
array([[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0.69314718, 0. ,
0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 1.09861229, 0. ,
0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 1.38629436, 0. ,
0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 0.69314718, 0.69314718,
0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 1.09861229, 1.09861229,
0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 1.38629436, 1.38629436,
0. , 0. , 0. , 0. ]])
@njsmith Also, it appears that travis isn't kicking off for this all of a sudden. Any ideas why this would be?
I'm fairly certain this will pass. Here is the branch in my travis.
It seems like it would be simpler to query a ModelDesc
for all the variables it uses, period? And then it'd be your job to ignore the ones that aren't present in your data set. This would also be more accurate, because
The even simpler (from the API design / coupling perspective) would be to pass in a dict-like that loads the data lazily on demand, like:
class LazyData(dict):
def __missing__(self, key):
try:
return bcolz.load(key, file)
except BcolzKeyNotFound:
raise KeyError(key)
Would this work for you?
Is the partial
part somehow tied to the var_names
part? They look like separate changes to me, so should be in separate PRs?
This is also missing lots of tests, but let's not worry about that until after the high-level discussion...
It seems like it would be simpler to query a ModelDesc for all the variables it uses, period? And then it'd be your job to ignore the ones that aren't present in your data set. This would also be more accurate, because
Hmm... I hadn't thought of that. That should be relatively easy to add/change based on what I've done so far. The heart of this is the var_names
is on the EvalFactor
class that looks at all the objects needed to evalulate the factor using the ast_names
function. This is in turn used by the Term
class... (and is used in turn by DesignInfo
class). ModelDesc
has a list of terms (lhs_termlist
and rhs_termlist
), so adding this would be easy.
I presume you're implying that I shouldn't be worrying about the EvalEnvironment
variable and just return every dependent object--function and module alike? I was trying to return only "data"-ish things. Simply removing them from the output set manually set seems easy enough...
The even simpler (from the API design / coupling perspective) would be to pass in a dict-like that loads the data lazily on demand, like:
This is really clever, thanks! I'll try it. However, I don't think it solves the partial
issue below.
Is the partial part somehow tied to the var_names part? They look like separate changes to me, so should be in separate PRs?
Yes. partial
looks at each Term
's var_names
and decides whether the Term
needs the variable or not. If yes, it pulls that Term
using subset
to create the design matrix only for that subset of columns using the variables specified. Otherwise, it returns columns full of zeros. The end result is a design matrix of the same width and column alignment as the model's DesignMatrix
, but only with as many rows as needed to evaluate the partial predictions and the rest of the columns as zeros.
This is also missing lots of tests, but let's not worry about that until after the high-level discussion...
Sound good. Writing tests is not something I've excelled at. This is somewhat tested and I (think) there is coverage for most of the new lines--likely I missed a few. I added some asserts
to some of the existing tests with the new functionality.