StatsModels.jl icon indicating copy to clipboard operation
StatsModels.jl copied to clipboard

Specify certain variables as categorical

Open matthieugomez opened this issue 6 years ago • 10 comments
trafficstars

The same variable can be considered as continuous or as categorical when fitting different regressions. One example is a date variable, which can alternatively be used as a continuous variable (to adjust for time trend) or as a categorical variable (to adjust for any time specific effect).

Currently, this requires to create two versions of the variable in the dataframe, one continuous and the other categorical. To avoid this situation, it would be great to allow users to specify that certain variables should be treated as categorical directly in the formula.

matthieugomez avatar Mar 22 '19 15:03 matthieugomez

It doesn't go in the formula directly but you can specify it in the schema (directly or via the hints). I think the contrasts argument in fit/model frame will also propagate to hints. I'll check when I'm back to my computer.

On Mar 22, 2019, at 08:40, Matthieu Gomez [email protected] wrote:

The same variable can alternatively be considered as continuous or categorical when fitting regressions. One example is a date variable, which can alternatively be used as a continuous variable (to adjust for time trend) or as a categorical variable (to adjust for any time specific effect). Currently, this requires to create two versions of the variable in the dataframe, one continuous and the other categorical.

It would be great to be able to specify that certain variables should be treated as categorical in the formula.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

kleinschmidt avatar Mar 22 '19 16:03 kleinschmidt

We could special-case categorical(x) in the formula so that it's equivalent to actually creating a CategoricalArray, but without the intermediate allocation. Do you have any objection with the way categorical(x) currently works in formulas?

nalimilan avatar Mar 22 '19 18:03 nalimilan

Special casing would be great. One tiny drawback is that categorical is a bit verbose for something that common (in Stata, one can simply write i.x), but that's a really minor point.

matthieugomez avatar Mar 22 '19 18:03 matthieugomez

But can you confirm it already works as you would expect now with StatsModels master?

nalimilan avatar Mar 24 '19 13:03 nalimilan

No it does not (if I understand your question correctly). For now, using categorical in the formula fails because the package tries to apply the function elementwise.

using DataFrames, StatsModels
 d = DataFrame()
 d[:y] = [1:4;]
 d[:x1] = [5:8;]
ModelMatrix(ModelFrame(@formula(y~categorical(x1)), d))
#> ERROR: MethodError: no method matching categorical(::Int64)

# but this works:
d[:cx1] = categorical(d[:x1])
ModelMatrix(ModelFrame(@formula(y~cx1), d))
#> ModelMatrix{Array{Float64,2}}([1.0 0.0 0.0 0.0; 1.0 1.0 0.0 0.0; 1.0 0.0 1.0 0.0; 1.0 0.0 0.0 1.0], [1, 1, 1, 1])

matthieugomez avatar Mar 24 '19 14:03 matthieugomez

Ah, right, so that's the same issue as https://github.com/JuliaStats/StatsModels.jl/issues/94.

nalimilan avatar Mar 24 '19 15:03 nalimilan

Since it doesn't look like we'll change the behavior in #94, we should just special-case categorical. Maybe we should also support a short syntax for it, like c. A way to choose the reference level using its index would also be useful (like Stata's bX.var).

nalimilan avatar Apr 15 '19 12:04 nalimilan

I have been thinking about this for a bit. The way categorical variables are handled automatically seems un-mathematic.

@formula(y ~ x + b)

Automatically creates a vector of dummies for each level in :b, right? That's a big change and it breaks the 1-1 relationship between mathematical notation and the regression. If it weren't so breaking, I'd vote for categorical(b) to be always required.

pdeffebach avatar Apr 16 '19 04:04 pdeffebach

It's been the case even before Terms 2.0 (inspired by R I guess). I don't see the point of requiring categorical(b), since there's no other possible meaning for a non-numeric variable. AFAICT Stata's i.b requirement is just due to history, because they treat variables as numeric by default even when they have labels (Stata doesn't have the notion of factor).

nalimilan avatar Apr 16 '19 11:04 nalimilan

The tradeoff is that you don't know what @formula(y ~ x + b) means based off of inspecting the formula.

But ultimately I understand the appeal. The R formula syntax has been helpful for its simplicity.

pdeffebach avatar Apr 16 '19 14:04 pdeffebach