patsy
patsy copied to clipboard
formula support for categorical endog variable in logistic regression
patsy: '0.5.1'
https://github.com/statsmodels/statsmodels/issues/5552
SM: 0.9.0
For categorical endog variable in logistic regression, I still have to gerneate a dummay variable for it like the following.import pandas as pd import seaborn as sns import numpy as np import statsmodels.formula.api as smf # generate dummy df['male'] = df.sex.map({'Male': 1, 'Female': 0}) # regression formula = 'male ~ C(smoker) + C(time)' model = smf.logit(formula, data=df).fit() model.summary()
If I just do
formula = 'C(sex) ~ C(smoker) + C(time)' model = smf.logit(formula, data=df).fit() model.summary()
I will get
ValueError: operands could not be broadcast together with shapes (244,2) (244,)
This is a little bit weird, since the formula support all categorical variables but the endog. I wonder if this could be a poential feature to imporve. Btw, is there any current workaround for this issue if I wanna use formula?
@bashtage:
This is a patsy limit. You could just define a function C1
def C1(cat): return pd.get_dummies(cat, drop_first=True)
and then use
formula = 'C1(sex) ~ C(smoker) + C(time)'
I also wish patsy could offer the ability to specify the coding. For example, if I have a variable with two categories: Yes and No. I may wanna code Yes as 0 and No as 1, or Yes as 1 and No as 0. I don't see any way to control this in pandas unlesss I review the coding returned by pandas to choose which one to use. I think it would be much easier to do so if we can specify it just like how we specify a baseline in a categorical variable in a patsy formula.