StatsModels.jl icon indicating copy to clipboard operation
StatsModels.jl copied to clipboard

Should coeffnames be Symbols?

Open oxinabox opened this issue 5 years ago • 21 comments

It might be a bit breaking, but Symbol is the type generally used for The Name of a Thing

and DataFrame column names are Symbols, as are ColumnTables

oxinabox avatar Jun 13 '19 13:06 oxinabox

Another thing to consider is whether we want the type to change depending on the inputs,

data = [(x = 0,)]
coefnames(apply_schema(@formula(0 ~ 0), schema(data)))
coefnames(apply_schema(@formula(0 ~ 1), schema(data)))
coefnames(apply_schema(@formula(0 ~ 1 + x), schema(data)))
coefnames(apply_schema(@formula(1 ~ 0), schema(data)))
coefnames(apply_schema(@formula(1 ~ 1), schema(data)))
coefnames(apply_schema(@formula(1 ~ 1 + x), schema(data)))

Nosferican avatar Jun 27 '19 02:06 Nosferican

I really like the idea of using Symbol for coefnames instead of strings but what would be used for interaction terms? Would we just want to currently developed string name and convert it to a symbol?

Tokazama avatar Jan 16 '20 14:01 Tokazama

Sure, why not?

oxinabox avatar Jan 16 '20 18:01 oxinabox

As noted by @kleinschmidt at #169, it's not clear what would be the advantage of doing this. Yet it's quite disruptive. Can you develop?

nalimilan avatar Feb 13 '20 21:02 nalimilan

Just to add to what @oxinabox said in the OP, terms are also using symbols as names so it seems very odd that the only place we are using strings to refer to variables is the coefficients.

The actually reason I started looking at this was because it's easier to speed up operations on symbols than strings. I was looking at ways to query coefficients within models and formulas.

I think that if we were discussing this in the absence of other package dependencies the proposed solution would be a pretty obvious improvement. I consider statistical modeling a pretty important feature of Julia so I understand that disrupting the existing ecosystem isn't a trivial thing.

Tokazama avatar Feb 13 '20 21:02 Tokazama

To repreat my original point: it puts us inline with how Tables.jl represents things.

oxinabox avatar Feb 14 '20 20:02 oxinabox

We should probably also address what we'd miss out on if we no longer used strings

Tokazama avatar Feb 15 '20 19:02 Tokazama

I see the point of consistency with Tables. OTOH, coefficients names are not exactly like terms, as you won't type them in a Julia-like syntax and they are often not valid Julia identifiers. This is clearly visible in the PR's tests: very often coefficient names cannot be typed nor printed directly using the : syntax. Compare:

julia> ["Intercept", "x1p: 6", "x1p: 7", "x1p: 8"]
4-element Array{String,1}:
 "Intercept"
 "x1p: 6"
 "x1p: 7"
 "x1p: 8"

julia> [:Intercept, Symbol("x1p: 6"), Symbol("x1p: 7"), Symbol("x1p: 8")]
4-element Array{Symbol,1}:
 :Intercept
 Symbol("x1p: 6")
 Symbol("x1p: 7")
 Symbol("x1p: 8")

Can you explain in what context does the performance of symbols compared to strings really matters?

nalimilan avatar Feb 16 '20 17:02 nalimilan

Aye. I would not be opposed to it if they mapped to the table, but something like

insure: Indemnity ~ (Intercept)   1.28694     0.59232     2.17271     0.0302   0.123689   2.4502   
insure: Indemnity ~ age           0.00779612  0.0114418   0.681372    0.4959  -0.0146743  0.0302666
insure: Indemnity ~ male         -0.451848    0.367486   -1.22957     0.2193  -1.17355    0.269855 
insure: Indemnity ~ nonwhite     -0.217059    0.425636   -0.509965    0.6103  -1.05296    0.618843 
insure: Indemnity ~ site: 2       1.21152     0.470506    2.57493     0.0103   0.287497   2.13554  
insure: Indemnity ~ site: 3       0.207813    0.366293    0.56734     0.5707  -0.511547   0.927172 
insure: Prepaid ~ (Intercept)     1.55666     0.596327    2.61041     0.0093   0.385533   2.72778  
insure: Prepaid ~ age            -0.00394887  0.0115993  -0.340439    0.7336  -0.0267287  0.018831 
insure: Prepaid ~ male            0.109846    0.365187    0.300793    0.7637  -0.607343   0.827035 
insure: Prepaid ~ nonwhite        0.757718    0.419575    1.80592     0.0714  -0.0662835  1.58172  
insure: Prepaid ~ site: 2         1.32456     0.469789    2.81947     0.0050   0.401941   2.24717  
insure: Prepaid ~ site: 3        -0.380175    0.372819   -1.01973     0.3083  -1.11235    0.352001 

has both the response and the predictor so it wouldn't map to a table. An even simpler case,

(Intercept)   8.23837      0.13135      62.721      <1e-99   7.98093       8.49581    
Educatn       0.000538588  0.000138739   3.88203    0.0001   0.000266663   0.000810512
Age           0.0389003    0.0068679     5.66408    <1e-7    0.0254394     0.0523613  
Age ^ 2      -0.000191192  8.86411e-5   -2.15692    0.0310  -0.000364926  -1.74575e-5 

(Intercept) is not a valid table column nor is Age ^ 2.

Nosferican avatar Feb 16 '20 17:02 Nosferican

Can you explain in what context does the performance of symbols compared to strings really matters?

My point with performance was that using a string to index vs symbol should have a similar performance hit as this:

julia> using BenchmarkTools

julia> x = [:a, :b, :c, :d, :e, :f, :g];

julia> y = string.(x);

julia> @btime findfirst(==(:e), x)
@  82.534 ns (1 allocation: 16 bytes)
5

julia> @btime findfirst(==("e"), y)
  111.964 ns (1 allocation: 16 bytes)
5

Tokazama avatar Feb 16 '20 17:02 Tokazama

Well, if we're not using the traditional table interface then why do we need to use a String or Symbol for the rowname? Just use a dictionary with the terms.

Tokazama avatar Feb 16 '20 17:02 Tokazama

My point with performance was that using a string to index vs symbol should have a similar performance hit as this:

But in practice, in what concrete cases does the 30ns overhead really matter?

Well, if we're not using the traditional table interface then why do we need to use a String or Symbol for the rowname? Just use a dictionary with the terms.

Sorry, I don't get what you mean.

nalimilan avatar Feb 17 '20 12:02 nalimilan

But in practice, in what concrete cases does the 30ns overhead really matter?

I imagine it would only matter if a particular algorithm required frequently looking up coefficients via a Symbol. I don't think it' a common problem for something like a GLM but imagine it could become a source of overhead for something like a mixed effects model with many interacting random effects.

Sorry, I don't get what you mean.

This was in reference to comments about the Symbol("...") syntax not being as idiomatic as using strings. The point of the comment was that if the only reason to keep using strings is because it's easier to reference coefficients with them and print their names in a table then we may as well just use a Term. Those already print out the way we want them to and have a more direct relationship to the actual coefficient than a string does.

I understand these may seem like a lot of trivial points, but I still don't understand the benefit of using a string at all. Perhaps using a Symbol isn't the right thing to do, but I can at least see some benefit to using it over the current method, but maybe I'm totally missing something.

Tokazama avatar Feb 17 '20 13:02 Tokazama

But coefficient names aren't terms. There's a one-to-one relationship between coefficients and terms only for simple continuous terms, but not for categorical terms nor for splines, etc. We could introduce another type which would also store a reference to the term if that's useful, but coefnames would still have to return something that's name-like (strings or symbols).

The performance advantage doesn't sound like a strong motivation to me, given that nobody seems to have a real use case for it. Compared to fitting a model, the cost is probably not that high.

nalimilan avatar Feb 17 '20 13:02 nalimilan

The performance thing was more of an example that there is some benefit to using symbols. I apologize for taking too much time on that.

We could introduce another type which would also store a reference to the term if that's useful

This might actually be very useful. My original interest in this started with trying to get plots for statistic models (https://github.com/JuliaPlots/StatsPlots.jl/issues/290).

Tokazama avatar Feb 17 '20 13:02 Tokazama

It is practically very common for me to make actual tables where the column names at the coefnames so I can eyeball it and check it did what i wanted before using it to fit a model

We don't provide a good interface for that yet. We could provide that interface which would convert the coefnames to symbols.

oxinabox avatar Feb 17 '20 13:02 oxinabox

We've run into a very similar use case recently, wanting to generate table-like things from the outputs of simulations where there's one entry in a vector per coef, and want to make a table to collect the results. It's easy enough to do Symbol.(coefnames(f)), but it could be nice to agree on a default. I don't know how breaking it would be to use symbols instead of strings from external perspective; internally we use the string concatenation syntax to (for instance) generate coefnames for categorical variables, but that's not a huge thing to change.

kleinschmidt avatar Feb 17 '20 14:02 kleinschmidt

I guess that's a tradeoff depending on the use case. But in many instances strings will be more convenient than symbols since (as I noted) there's no direct way to enter symbols that are not valid identifiers using :.

Maybe we could make it easier to create tables from names as string, automatically converting to symbols?

nalimilan avatar Feb 17 '20 16:02 nalimilan

It seems like the greatest friction here is being caused by the extra syntax that using symbols would require? This is really only a problem for categorical and interacting terms, which the user never has to actually type in themselves unless searching for them using coefnames. Perhaps this seems could be solved using some special syntax with terms like term(:categories)[:g1] to refer to group one of the categories variable.

Tokazama avatar Feb 17 '20 16:02 Tokazama

Not really. Any FunctionTerm would have that issue (e.g., Age ^ 2). I don’t see what would be gained exactly. What would the change help?

Nosferican avatar Feb 17 '20 16:02 Nosferican

Sorry, for some reason I thought you could do term(:Age)^2 already. I was just trying to think of ways to take advantage of the more recent terms interface to solve the problem.

Tokazama avatar Feb 18 '20 23:02 Tokazama