patsy
patsy copied to clipboard
Categorical does not work with nan
I have a columns whose unique looks like:
array([nan, 'CONFERENCE', 'ANALYST', 'FORUM', 'SEMINAR'], dtype=object)
I would expect that adding C(col_name)
to the formula would create 4 dummy variables (5 values-1), bu in fact it only adds 3.
When I tried to explicitly set control to be nan
, i get an exception:
C(col_name, Treatment(reference=nan))
PatsyError: specified level nan not found
By default, patsy thinks that 'nan' indicates missing data, and is dropping those rows from your data rather than treating them like a 5th category. (If they really are missing then treating them like a 5th category is pretty statistically suspect, I think...) If this isn't what you want, then dmatrix and friends take an NA_action= argument, to which you can pass an NAAction object set up to tell patsy what you really want it to do: http://patsy.readthedocs.org/en/latest/API-reference.html#patsy.NAAction (Notice that by default NA_types includes "nan" -- this is what's causing your problem.)
If you want to just disable missing value handling altogether, that can be accomplished with something like: dmatrix(..., NA_action=NAAction(NA_types=[]))
Does that help?
On Tue, Mar 18, 2014 at 5:34 PM, Alex Rothberg [email protected]:
I have a columns whose unique looks like:
array([nan, 'CONFERENCE', 'ANALYST', 'FORUM', 'SEMINAR'], dtype=object)
I would expect that adding C(col_name) to the formula would create 4 dummy variables (5 values-1), bu in fact it only adds 3.
When I tried to explicitly set control to be nan, i get an exception:
C(col_name, Treatment(reference=nan))
PatsyError: specified level nan not found
— Reply to this email directly or view it on GitHubhttps://github.com/pydata/patsy/issues/36 .
Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
Currently I am using patsy through statsmodels:
from statsmodels.formula.api import ols
model = ols( y ~ x...", data)
so how would I make changes to nan handling?
Also the rows with nan in them are definitely not being dropped.
I don't know -- I just tested what I said against patsy itself, and:
- by default it did in fact both ignore the nan when deciding how many levels there were, and then dropped that row when building the design matrix
- and if I set NA_action like I said, then it did include the nan when deciding how many levels were, and did include it correctedly in the design matrix.
So I guess it's a bug in how statsmodels is calling patsy...?
@jseabold @josef-pkt
On Tue, Mar 18, 2014 at 5:54 PM, Alex Rothberg [email protected]:
Currently I am using patsy through statsmodels:
from statsmodels.formula.api import ols model = ols( y ~ x...", data)
so how would I make changes to nan handling?
Also the rows with nan in them are definitely not being dropped.
— Reply to this email directly or view it on GitHubhttps://github.com/pydata/patsy/issues/36#issuecomment-37965573 .
Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
Related I guess https://github.com/statsmodels/statsmodels/issues/805
I haven't looked at this in a while, and we didn't coordinate well on this in the beginning. We tried to keep missing data handling mostly on our side because we have more than y/X to deal with.
...and patsy didn't have any missing data handling when I wrote that.
Brainstorming:
As a workaround if you want to be in charge of missing data handling you could just always disable patsy's. But this might make it tricky to handle categorical variables and stateful transforms right...
Ideal solution might be to move all NA handling into patsy, but to do that we'd need to add a way to pass parallel vectors through patsy (parallel = parallel to y/X, things like weights).
If you don't care about eliminating NA values in weights, then you could let patsy do the missing value handling and then peek at the index on the returned dataframe to see which rows got eliminated, and throw those out of the other vectors. I remember I ran into some problem trying to do this though in my own code and ended up with a hack instead: https://github.com/rerpy/rerpy/blob/master/rerpy/rerp.py#L339 I don't remember what exactly the problem was, I could probably find some notes somewhere...
Alex: Your best quick workaround might be to swap your nan values for a string, like "nan" or "--" or whatever that value actually means to you :-).
On Tue, Mar 18, 2014 at 6:11 PM, Skipper Seabold [email protected]:
...and patsy didn't have any missing data handling when I wrote that.
— Reply to this email directly or view it on GitHubhttps://github.com/pydata/patsy/issues/36#issuecomment-37967889 .
Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
Yep, putting this in formula seems to work:
C(col_name.fillna("N/A"), Treatment(reference="N/A"))
If there is not explicit missing='drop'
when creating the model, then statsmodels doesn't check at all for nans. The nan handling is all patsy in this case, and if there are no extra arrays, then, I think, there are not or should not be any problems.
The model is initialized with whatever endog and exog patsy returns.
If my reading of the statsmodels source is correct: https://github.com/statsmodels/statsmodels/blob/master/statsmodels/formula/formulatools.py#L38 https://github.com/statsmodels/statsmodels/blob/master/statsmodels/base/model.py#L109
Is there a replicable example or test case for this?
There's one at the top of the issue I linked to above. Note that the current behavior on that issue is the opposite of what was causing the problem before and what is causing the issue here. The nan category is dropped in patsy by default now I guess, and we don't do anything to control this.
Yes, I understand mostly our problems with statsmodels 805, however, I think in this issue, patsy 36, the missing data handling of statsmodels is not involved at all. So this issue should be all patsy, even if the call goes through statsmodels.
maybe I'm late and cancan101's solution/workaround already made this clear.
See the second comment above. The issue from our end is that we don't pass any NA handling to patsy under the hood, so we don't have any way to suppress its dropping of NAs in the categoricals. So the issue with #805 is actually resolved, but it's because the defaults in patsy changed / missing data handling was added. We don't allow users to treat NaNs as a category right now. (I'm not convinced we should, though.)
Ok, I see, I didn't understand that part.
So the from_formula
method needs to hand off some patsy_options
to dmatrices. ?
which might collide with whatever deterministic (not user influenced) behavior we want to expect from patsy. Users should have the option to turn off patsy's nan checking if they don't want any at all.
Just for reference: in pandas you can now add np.nan
as a level:
a = array([nan, 'CONFERENCE', 'ANALYST', 'FORUM', 'SEMINAR'], dtype=object)
df[cats] = pd.Categorical(a, levels=a) # works here because a has only unique values
Not sure what patsys makes from that and how it gets the reference level, though.