StatsModels.jl RFC: add an additional apply

There are some situations where you want to transform terms before the schema is available, in part because this may affect how the schema itself is computed. For instance, for MixedModels.jl, we often have categorical "grouping variables" with a very large number of levels, for which the contrasts matrix is expensive to compute and store but which is never actually used. These are specified as the second argument in expressions like (1 + x + y | group), which gets initially parsed as FunctionTerm{typeof(|)}. It would, of course, be possible for MixedModels to add an additional method to concrete_term that says to treat the :group variable as categorical but not to compute the contrasts matrix (e.g., by using some other hypothetical GroupingTerm struct), but that's type piracy because another package might want to use calls to | in a different way.

So the proposal is to add a stage that's similar to apply_schema(term, schema, context) but which doesn't have the schema, something like apply_context(term, context) and runs before schema. Then in the case of MixedModels.jl, you could define apply_context(::FunctionTerm{typeof(|)}, ::Type{<:MixedModel}) = RanefTerm(...), and then dispatch on RanefTerm for schema/concrete_term.

cc. @dmbates, @palday

Sep 28 '19 09:09 kleinschmidt

Another example is the nesting syntax. I have some initial work on parsing it as FunctionTerm{typeof(/)}. In the R variant a/b expands to a + a:b, but this has slightly different implications for experimental vs. blocking variables. For blocking variables, (1|a/b) expands to (1|a) + 1|a:b), where that second interaction really should be treated as creating a new blocking variable whose names are just the concatenation of the other two instead of creating the full product of their respective contrast matrices. For experimental factors, it gives simple effects for a but no effects for b beyond the interaction term. (So it's like a*b - b).

Sep 28 '19 10:09 palday

this feels like it would potentially be the right time for things relating to #116 E.g. the re-treatement of all terms to be interaction terms, because the model only knows about interaction terms. Which is true regardless of the schema for the data, since it is a property of the model.

OTOH, isn't it the cause that a given model works only for a particular data schema anyway? I guess not, to use my NN example, (basicsally) all NNs have the all terms are interaction terms, but a given NN for a given problem may or may not need other things defined by a schema in terms of dummy coding. And potentially some subtypes might want to do different encoodings, (one hot vs one cold is the obvious example)

Sep 28 '19 10:09 oxinabox

Yeah they might be related. Then again, I think the only cases that really require this change are cases where the way you compute the schema actually depends on the context+formula combination (since currently the formula and context only meet at the apply_schema stage)

Sep 28 '19 10:09 kleinschmidt

The other thing this would allow us to support very easily is marking certain variables as categorical or continuous in the formula itself. Currently, there's no way to "go back" and extract levels for a variable that by default would be treated as continuous based on a function call that it participates in; by the time you're looking at calls, you already have the schema in hand (in the current setup).

Sep 28 '19 19:09 kleinschmidt

RFC: add an additional apply_context stage