StatsModels.jl
StatsModels.jl copied to clipboard
Specify certain variables as categorical
The same variable can be considered as continuous or as categorical when fitting different regressions. One example is a date variable, which can alternatively be used as a continuous variable (to adjust for time trend) or as a categorical variable (to adjust for any time specific effect).
Currently, this requires to create two versions of the variable in the dataframe, one continuous and the other categorical. To avoid this situation, it would be great to allow users to specify that certain variables should be treated as categorical directly in the formula.
It doesn't go in the formula directly but you can specify it in the schema (directly or via the hints). I think the contrasts argument in fit/model frame will also propagate to hints. I'll check when I'm back to my computer.
On Mar 22, 2019, at 08:40, Matthieu Gomez [email protected] wrote:
The same variable can alternatively be considered as continuous or categorical when fitting regressions. One example is a date variable, which can alternatively be used as a continuous variable (to adjust for time trend) or as a categorical variable (to adjust for any time specific effect). Currently, this requires to create two versions of the variable in the dataframe, one continuous and the other categorical.
It would be great to be able to specify that certain variables should be treated as categorical in the formula.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
We could special-case categorical(x) in the formula so that it's equivalent to actually creating a CategoricalArray, but without the intermediate allocation. Do you have any objection with the way categorical(x) currently works in formulas?
Special casing would be great. One tiny drawback is that categorical is a bit verbose for something that common (in Stata, one can simply write i.x), but that's a really minor point.
But can you confirm it already works as you would expect now with StatsModels master?
No it does not (if I understand your question correctly). For now, using categorical in the formula fails because the package tries to apply the function elementwise.
using DataFrames, StatsModels
d = DataFrame()
d[:y] = [1:4;]
d[:x1] = [5:8;]
ModelMatrix(ModelFrame(@formula(y~categorical(x1)), d))
#> ERROR: MethodError: no method matching categorical(::Int64)
# but this works:
d[:cx1] = categorical(d[:x1])
ModelMatrix(ModelFrame(@formula(y~cx1), d))
#> ModelMatrix{Array{Float64,2}}([1.0 0.0 0.0 0.0; 1.0 1.0 0.0 0.0; 1.0 0.0 1.0 0.0; 1.0 0.0 0.0 1.0], [1, 1, 1, 1])
Ah, right, so that's the same issue as https://github.com/JuliaStats/StatsModels.jl/issues/94.
Since it doesn't look like we'll change the behavior in #94, we should just special-case categorical. Maybe we should also support a short syntax for it, like c. A way to choose the reference level using its index would also be useful (like Stata's bX.var).
I have been thinking about this for a bit. The way categorical variables are handled automatically seems un-mathematic.
@formula(y ~ x + b)
Automatically creates a vector of dummies for each level in :b, right? That's a big change and it breaks the 1-1 relationship between mathematical notation and the regression. If it weren't so breaking, I'd vote for categorical(b) to be always required.
It's been the case even before Terms 2.0 (inspired by R I guess). I don't see the point of requiring categorical(b), since there's no other possible meaning for a non-numeric variable. AFAICT Stata's i.b requirement is just due to history, because they treat variables as numeric by default even when they have labels (Stata doesn't have the notion of factor).
The tradeoff is that you don't know what @formula(y ~ x + b) means based off of inspecting the formula.
But ultimately I understand the appeal. The R formula syntax has been helpful for its simplicity.