docs need to explain "aliases" better
https://juliastats.org/StatsModels.jl/latest/contrasts/#Further-details says
A categorical variable in a term aliases the term that remains when that variable is dropped.
I can't figure out what this means, even after looking at the examples following that quote. Are there some words missing?
For example:
- In
~a+b+creferring to the term that remains after droppingais meaningless since there are 2. - So does
aalias bothbandcin that case? - Which seems to mean everything aliases everything, a not particularly helpful concept.
- The first example says the sole variable
aaliases the intercept1. But if there is no explicit1then what? (A bit further down it says ify~0+athenaaliases nothing. But what if the intercept is implicit?) - "Linear Dependence" or "not full rank" at least mean something to me, and seem in the same ballpark. But the discussion clearly intends aliasing to occur even absent linear dependence.
- My immediate association with alias in the context of computers is 2 variables referring to overlapping memory. That's clearly not the intended meaning here.
- In
~a&b + a&b&cthe first expressiona&bis completely redundant. It is unclear how that's handled byStatsModelsor how it relates to the discussion of aliases. - Handling of
~a&b + a&calso unclear. - But since I couldn't even understand a simple main effects model in 1., it's unsurprising I don't understand interactions.
Sorry to hear the explanation of this behavior isn't clear! If there are specific situations that you've run into that lead to diving deeply into this behavior, that might help with improving the docs. If it's just curiosity, then that's something I understand well :) We're always open to suggestions for how to improve the docs. From your questions I think there are a few points of clarification that would help:
- When we're talking about "dropping terms", that refers to one of two cases: removing one variable from an interaction term, or "removing" one variable from a main effect term (leaving the interaction). So to your 1.-3., "dropping
afroma" gives1; so each ofa,b, adncalias the (implicit) intercept - Implicit intercepts work exactly like an explicit
1 - The term "alias" is meant to capture the hypothetical situation that IF a variable in a term was represent with "full rank" contrasts, then because of the other variable present in the same term (e.g., the rest of the variables in the interaction) THEN you would end up with a model matrix that is not full rank. Obviously saying that's a bit hefty as a bit of language to refer to something that is mentioned many times in this docs section, and "alias" is the best we could come up with! Maybe something like "bridge" would make that clearer; as a bit of linguistic arcana, the problem we ran into is that most of the verbs the capture this kind of connection have multiple thematic role structures (so the subject 'alias' can be the thing that's being connected, or the thing that's connecting something else...).
- Generally the algorithm goes left to right, checking what terms have been found so far and promoting categorical variables in the current term if it doesn't alias a term that's been found so far; when such a promotion happens, the aliased term is added to the internal list of terms encountered so far.
- For the specific of case of
~a&b + a&c(assuming all three ofabandcare categorical), we first add the implicit intercept. Then we checka&b, starting witha. sincebhasn't been seen yet,ais promoted to full rank and a main effectbis added to the list of things we've seen. Now we look atbina&b; dropping that givesa, which also has not been seen on its own, so we promotebto full rank and addato the list of things we've seen. Next up isa&c; dropa, which gives a main effect ofc, which hasn't been seen yet either, soagets promoted here, too, and main effect ofcgets added to the list. Lastly, we dropcfroma&c; that gives main effect ofa, which we HAVE already seen (from promotingbto full rank), socstays as it is. So what we end up with isf(a)&f(b) + f(a)&c(usingf()as a shorthand for a term withFullRankDummyCoding). - There are, of course, extensive tests for this behavior https://github.com/JuliaStats/StatsModels.jl/blob/master/test/modelmatrix.jl#L157-L301
To be honest, it took me quite a bit of head scratching, testing things in R, and reading the MASS book to figure out what R was doing clearly enough to implement the similar functionality here. We went back and forth quite a bit on how to explain things, and it's hard to strike a balance between being concise and clear without being overly technical and still providing enough information to make sense of behavior that can be sometimes counterintuitive. I'm afraid that there's no amount of explanation that will make this stuff be super clear without a similar amount of head scratching.
Thank you for your detailed response. I haven't fully processed it, but I can tell you some more about why I am digging into this.
I want to implement an extension of formula that would allow one to request all interactions up to a certain order from a list of variables, something like nway(a b c d e f, 3) to generate all terms up to 3 way interactions. Another way of thinking of it is that it's `a*b*c*d*e*f`` with the more elaborate interactions dropped. There's a lot of potential redundancy there, and I want to understand how that's handled. The variables in the list may not all be categorical.
Ah yeah that makes sense. StatsModels SHOULD just work if you emit the necessary interaction terms before calling apply schema. Also I think in RegressionFormulae.jl we implement the R-style ^ syntax for doing that. Even if you still need to implement it for your own purposes that should help providing some guidance!
Thanks for the pointer; RegressionFormula.jl looks like just what I need. At least until things get more complicated.
I'm still not following all the details, but perhaps the docs could just note that if expanding a categorical variable to full rank in a particular term would create a linear dependency with columns to the left, StatsModels instead uses the reduced rank expansion.
For comparison, y~a+b+a gets turned into y~a+b and this behavior, as far as I can tell, is completely undocumented.
With any elimination behavior, there are 3 approaches to documentation:
- Don't mention it.
- Describe the goal of the behavior; that's what I did above.
- Describe how the goal is achieved. The current docs are sort of trying to do that.