StatsModels.jl docs need to explain "aliases" better

https://juliastats.org/StatsModels.jl/latest/contrasts/#Further-details says

A categorical variable in a term aliases the term that remains when that variable is dropped.

I can't figure out what this means, even after looking at the examples following that quote. Are there some words missing?

For example:

In ~a+b+c referring to the term that remains after dropping a is meaningless since there are 2.
So does a alias both b and c in that case?
Which seems to mean everything aliases everything, a not particularly helpful concept.
The first example says the sole variable a aliases the intercept 1. But if there is no explicit 1 then what? (A bit further down it says if y~0+a then a aliases nothing. But what if the intercept is implicit?)
"Linear Dependence" or "not full rank" at least mean something to me, and seem in the same ballpark. But the discussion clearly intends aliasing to occur even absent linear dependence.
My immediate association with alias in the context of computers is 2 variables referring to overlapping memory. That's clearly not the intended meaning here.
In ~a&b + a&b&c the first expression a&b is completely redundant. It is unclear how that's handled by StatsModels or how it relates to the discussion of aliases.
Handling of ~a&b + a&c also unclear.
But since I couldn't even understand a simple main effects model in 1., it's unsurprising I don't understand interactions.

Apr 02 '22 05:04 RossBoylan

Sorry to hear the explanation of this behavior isn't clear! If there are specific situations that you've run into that lead to diving deeply into this behavior, that might help with improving the docs. If it's just curiosity, then that's something I understand well :) We're always open to suggestions for how to improve the docs. From your questions I think there are a few points of clarification that would help:

When we're talking about "dropping terms", that refers to one of two cases: removing one variable from an interaction term, or "removing" one variable from a main effect term (leaving the interaction). So to your 1.-3., "dropping a from a" gives 1; so each of a, b, adn c alias the (implicit) intercept
Implicit intercepts work exactly like an explicit 1
The term "alias" is meant to capture the hypothetical situation that IF a variable in a term was represent with "full rank" contrasts, then because of the other variable present in the same term (e.g., the rest of the variables in the interaction) THEN you would end up with a model matrix that is not full rank. Obviously saying that's a bit hefty as a bit of language to refer to something that is mentioned many times in this docs section, and "alias" is the best we could come up with! Maybe something like "bridge" would make that clearer; as a bit of linguistic arcana, the problem we ran into is that most of the verbs the capture this kind of connection have multiple thematic role structures (so the subject 'alias' can be the thing that's being connected, or the thing that's connecting something else...).
Generally the algorithm goes left to right, checking what terms have been found so far and promoting categorical variables in the current term if it doesn't alias a term that's been found so far; when such a promotion happens, the aliased term is added to the internal list of terms encountered so far.
For the specific of case of ~a&b + a&c (assuming all three of a b and c are categorical), we first add the implicit intercept. Then we check a&b, starting with a. since b hasn't been seen yet, a is promoted to full rank and a main effect b is added to the list of things we've seen. Now we look at b in a&b; dropping that gives a, which also has not been seen on its own, so we promote b to full rank and add a to the list of things we've seen. Next up is a&c; drop a, which gives a main effect of c, which hasn't been seen yet either, so a gets promoted here, too, and main effect of c gets added to the list. Lastly, we drop c from a&c; that gives main effect of a, which we HAVE already seen (from promoting b to full rank), so c stays as it is. So what we end up with is f(a)&f(b) + f(a)&c (using f() as a shorthand for a term with FullRankDummyCoding).
There are, of course, extensive tests for this behavior https://github.com/JuliaStats/StatsModels.jl/blob/master/test/modelmatrix.jl#L157-L301

To be honest, it took me quite a bit of head scratching, testing things in R, and reading the MASS book to figure out what R was doing clearly enough to implement the similar functionality here. We went back and forth quite a bit on how to explain things, and it's hard to strike a balance between being concise and clear without being overly technical and still providing enough information to make sense of behavior that can be sometimes counterintuitive. I'm afraid that there's no amount of explanation that will make this stuff be super clear without a similar amount of head scratching.

Apr 02 '22 13:04 kleinschmidt

Thank you for your detailed response. I haven't fully processed it, but I can tell you some more about why I am digging into this.

I want to implement an extension of formula that would allow one to request all interactions up to a certain order from a list of variables, something like nway(a b c d e f, 3) to generate all terms up to 3 way interactions. Another way of thinking of it is that it's `a*b*c*d*e*f`` with the more elaborate interactions dropped. There's a lot of potential redundancy there, and I want to understand how that's handled. The variables in the list may not all be categorical.

Apr 02 '22 19:04 RossBoylan

Ah yeah that makes sense. StatsModels SHOULD just work if you emit the necessary interaction terms before calling apply schema. Also I think in RegressionFormulae.jl we implement the R-style ^ syntax for doing that. Even if you still need to implement it for your own purposes that should help providing some guidance!

Apr 02 '22 20:04 kleinschmidt

Thanks for the pointer; RegressionFormula.jl looks like just what I need. At least until things get more complicated.

I'm still not following all the details, but perhaps the docs could just note that if expanding a categorical variable to full rank in a particular term would create a linear dependency with columns to the left, StatsModels instead uses the reduced rank expansion.

For comparison, y~a+b+a gets turned into y~a+b and this behavior, as far as I can tell, is completely undocumented.

With any elimination behavior, there are 3 approaches to documentation:

Don't mention it.
Describe the goal of the behavior; that's what I did above.
Describe how the goal is achieved. The current docs are sort of trying to do that.

Apr 04 '22 17:04 RossBoylan