patsy icon indicating copy to clipboard operation
patsy copied to clipboard

formula support for categorical endog variable in logistic regression

Open stevenlis opened this issue 5 years ago • 1 comments

patsy: '0.5.1'

https://github.com/statsmodels/statsmodels/issues/5552

SM: 0.9.0 For categorical endog variable in logistic regression, I still have to gerneate a dummay variable for it like the following.

import pandas as pd
import seaborn as sns
import numpy as np
import statsmodels.formula.api as smf
# generate dummy
df['male'] = df.sex.map({'Male': 1, 'Female': 0})
# regression
formula = 'male ~ C(smoker) + C(time)'
model = smf.logit(formula, data=df).fit()
model.summary()

If I just do

formula = 'C(sex) ~ C(smoker) + C(time)'
model = smf.logit(formula, data=df).fit()
model.summary()

I will get

ValueError: operands could not be broadcast together with shapes (244,2) (244,) 

This is a little bit weird, since the formula support all categorical variables but the endog. I wonder if this could be a poential feature to imporve. Btw, is there any current workaround for this issue if I wanna use formula?

@bashtage:

This is a patsy limit. You could just define a function C1

def C1(cat):
     return pd.get_dummies(cat, drop_first=True)

and then use

formula = 'C1(sex) ~ C(smoker) + C(time)'

stevenlis avatar May 16 '19 15:05 stevenlis

I also wish patsy could offer the ability to specify the coding. For example, if I have a variable with two categories: Yes and No. I may wanna code Yes as 0 and No as 1, or Yes as 1 and No as 0. I don't see any way to control this in pandas unlesss I review the coding returned by pandas to choose which one to use. I think it would be much easier to do so if we can specify it just like how we specify a baseline in a categorical variable in a patsy formula.

stevenlis avatar May 16 '19 15:05 stevenlis