StatsModels.jl
StatsModels.jl copied to clipboard
Should coeffnames be Symbols?
It might be a bit breaking,
but Symbol
is the type generally used for The Name of a Thing
and DataFrame column names are Symbols
,
as are ColumnTables
Another thing to consider is whether we want the type to change depending on the inputs,
data = [(x = 0,)]
coefnames(apply_schema(@formula(0 ~ 0), schema(data)))
coefnames(apply_schema(@formula(0 ~ 1), schema(data)))
coefnames(apply_schema(@formula(0 ~ 1 + x), schema(data)))
coefnames(apply_schema(@formula(1 ~ 0), schema(data)))
coefnames(apply_schema(@formula(1 ~ 1), schema(data)))
coefnames(apply_schema(@formula(1 ~ 1 + x), schema(data)))
I really like the idea of using Symbol
for coefnames instead of strings but what would be used for interaction terms? Would we just want to currently developed string name and convert it to a symbol?
Sure, why not?
As noted by @kleinschmidt at #169, it's not clear what would be the advantage of doing this. Yet it's quite disruptive. Can you develop?
Just to add to what @oxinabox said in the OP, terms are also using symbols as names so it seems very odd that the only place we are using strings to refer to variables is the coefficients.
The actually reason I started looking at this was because it's easier to speed up operations on symbols than strings. I was looking at ways to query coefficients within models and formulas.
I think that if we were discussing this in the absence of other package dependencies the proposed solution would be a pretty obvious improvement. I consider statistical modeling a pretty important feature of Julia so I understand that disrupting the existing ecosystem isn't a trivial thing.
To repreat my original point: it puts us inline with how Tables.jl represents things.
We should probably also address what we'd miss out on if we no longer used strings
I see the point of consistency with Tables. OTOH, coefficients names are not exactly like terms, as you won't type them in a Julia-like syntax and they are often not valid Julia identifiers. This is clearly visible in the PR's tests: very often coefficient names cannot be typed nor printed directly using the :
syntax. Compare:
julia> ["Intercept", "x1p: 6", "x1p: 7", "x1p: 8"]
4-element Array{String,1}:
"Intercept"
"x1p: 6"
"x1p: 7"
"x1p: 8"
julia> [:Intercept, Symbol("x1p: 6"), Symbol("x1p: 7"), Symbol("x1p: 8")]
4-element Array{Symbol,1}:
:Intercept
Symbol("x1p: 6")
Symbol("x1p: 7")
Symbol("x1p: 8")
Can you explain in what context does the performance of symbols compared to strings really matters?
Aye. I would not be opposed to it if they mapped to the table, but something like
insure: Indemnity ~ (Intercept) 1.28694 0.59232 2.17271 0.0302 0.123689 2.4502
insure: Indemnity ~ age 0.00779612 0.0114418 0.681372 0.4959 -0.0146743 0.0302666
insure: Indemnity ~ male -0.451848 0.367486 -1.22957 0.2193 -1.17355 0.269855
insure: Indemnity ~ nonwhite -0.217059 0.425636 -0.509965 0.6103 -1.05296 0.618843
insure: Indemnity ~ site: 2 1.21152 0.470506 2.57493 0.0103 0.287497 2.13554
insure: Indemnity ~ site: 3 0.207813 0.366293 0.56734 0.5707 -0.511547 0.927172
insure: Prepaid ~ (Intercept) 1.55666 0.596327 2.61041 0.0093 0.385533 2.72778
insure: Prepaid ~ age -0.00394887 0.0115993 -0.340439 0.7336 -0.0267287 0.018831
insure: Prepaid ~ male 0.109846 0.365187 0.300793 0.7637 -0.607343 0.827035
insure: Prepaid ~ nonwhite 0.757718 0.419575 1.80592 0.0714 -0.0662835 1.58172
insure: Prepaid ~ site: 2 1.32456 0.469789 2.81947 0.0050 0.401941 2.24717
insure: Prepaid ~ site: 3 -0.380175 0.372819 -1.01973 0.3083 -1.11235 0.352001
has both the response and the predictor so it wouldn't map to a table. An even simpler case,
(Intercept) 8.23837 0.13135 62.721 <1e-99 7.98093 8.49581
Educatn 0.000538588 0.000138739 3.88203 0.0001 0.000266663 0.000810512
Age 0.0389003 0.0068679 5.66408 <1e-7 0.0254394 0.0523613
Age ^ 2 -0.000191192 8.86411e-5 -2.15692 0.0310 -0.000364926 -1.74575e-5
(Intercept)
is not a valid table column nor is Age ^ 2
.
Can you explain in what context does the performance of symbols compared to strings really matters?
My point with performance was that using a string to index vs symbol should have a similar performance hit as this:
julia> using BenchmarkTools
julia> x = [:a, :b, :c, :d, :e, :f, :g];
julia> y = string.(x);
julia> @btime findfirst(==(:e), x)
@ 82.534 ns (1 allocation: 16 bytes)
5
julia> @btime findfirst(==("e"), y)
111.964 ns (1 allocation: 16 bytes)
5
Well, if we're not using the traditional table interface then why do we need to use a String
or Symbol
for the rowname? Just use a dictionary with the terms.
My point with performance was that using a string to index vs symbol should have a similar performance hit as this:
But in practice, in what concrete cases does the 30ns overhead really matter?
Well, if we're not using the traditional table interface then why do we need to use a
String
orSymbol
for the rowname? Just use a dictionary with the terms.
Sorry, I don't get what you mean.
But in practice, in what concrete cases does the 30ns overhead really matter?
I imagine it would only matter if a particular algorithm required frequently looking up coefficients via a Symbol
. I don't think it' a common problem for something like a GLM but imagine it could become a source of overhead for something like a mixed effects model with many interacting random effects.
Sorry, I don't get what you mean.
This was in reference to comments about the Symbol("...")
syntax not being as idiomatic as using strings. The point of the comment was that if the only reason to keep using strings is because it's easier to reference coefficients with them and print their names in a table then we may as well just use a Term
. Those already print out the way we want them to and have a more direct relationship to the actual coefficient than a string does.
I understand these may seem like a lot of trivial points, but I still don't understand the benefit of using a string at all. Perhaps using a Symbol
isn't the right thing to do, but I can at least see some benefit to using it over the current method, but maybe I'm totally missing something.
But coefficient names aren't terms. There's a one-to-one relationship between coefficients and terms only for simple continuous terms, but not for categorical terms nor for splines, etc. We could introduce another type which would also store a reference to the term if that's useful, but coefnames
would still have to return something that's name-like (strings or symbols).
The performance advantage doesn't sound like a strong motivation to me, given that nobody seems to have a real use case for it. Compared to fitting a model, the cost is probably not that high.
The performance thing was more of an example that there is some benefit to using symbols. I apologize for taking too much time on that.
We could introduce another type which would also store a reference to the term if that's useful
This might actually be very useful. My original interest in this started with trying to get plots for statistic models (https://github.com/JuliaPlots/StatsPlots.jl/issues/290).
It is practically very common for me to make actual tables where the column names at the coefnames
so I can eyeball it and check it did what i wanted
before using it to fit a model
We don't provide a good interface for that yet.
We could provide that interface which would convert the coefnames
to symbols.
We've run into a very similar use case recently, wanting to generate table-like things from the outputs of simulations where there's one entry in a vector per coef, and want to make a table to collect the results. It's easy enough to do Symbol.(coefnames(f))
, but it could be nice to agree on a default. I don't know how breaking it would be to use symbols instead of strings from external perspective; internally we use the string concatenation syntax to (for instance) generate coefnames for categorical variables, but that's not a huge thing to change.
I guess that's a tradeoff depending on the use case. But in many instances strings will be more convenient than symbols since (as I noted) there's no direct way to enter symbols that are not valid identifiers using :
.
Maybe we could make it easier to create tables from names as string, automatically converting to symbols?
It seems like the greatest friction here is being caused by the extra syntax that using symbols would require? This is really only a problem for categorical and interacting terms, which the user never has to actually type in themselves unless searching for them using coefnames. Perhaps this seems could be solved using some special syntax with terms like term(:categories)[:g1]
to refer to group one of the categories variable.
Not really. Any FunctionTerm would have that issue (e.g., Age ^ 2
). I don’t see what would be gained exactly. What would the change help?
Sorry, for some reason I thought you could do term(:Age)^2
already. I was just trying to think of ways to take advantage of the more recent terms interface to solve the problem.