re.form: query about documentation
The Technical Note in https://easystats.github.io/modelbased/articles/mixed_models.html#generalized-linear-mixed-models says:
For backend = "marginaleffects", the re.form argument is set to NULL for mixed models by default, to calculate marginal predictions. You can use for instance re.form = NA in your estimate_means() call to change the default value (NA will produce conditional predictions).
Documentation of glmmTMB's predict function says:
re.form \code{NULL} to specify individual-level predictions; \code{~0} or \code{NA} to specify population-level predictions (i.e., setting all random effects to zero)
That seems contradictory to me, i.e. I read Tech Note as saying re.form=NA produces conditional predictions whereas glmmTMB is saying re.form=NA gives population-level predictions.
(I'm assuming that estimate_means with backend="marginaleffects" and estimate="average" calls marginaleffects::avg_predictions)
Hi, thanks for this feedback! (maybe it's at the moment more a discussion than an "issue", so we might move it to discussions for now, and then convert back to an issue if a specific problem was identified)
I think terminology is really bad, because there are the same words used for different things. I'm referring to Andrew's blog post, in particular this distition:
You'll see it throughout that post that re_formula = NA is conditional, while setting it to NULL is referred to as marginal.
The framing is like in the modelbased-vignette you linked to:
Conditional effect = the effect of a variable in an average cluster (i.e., group-specific, subject-specific or cluster-specific effect, or an average or a typical cluster)
Marginal effect = effect of a variable across clusters on average (i.e., global/population-level effect, or clusters on average).
Setting re.form = NA ignores random effects, thus, it's not group-specific in terms of predict(), but on the "population level". This means, you are "conditioning" on a "typical" cluster you would get on average, which is when "drawing from the population".
Setting re.form = NULL in predict() allows you to consider "subject" (or group) specific random effects, thus, you get "indidual-specific" predicitions. However, when you aggregate and average over them (which is, what estimate_means() then does), have an average effect across all clusters, i.e. you "marginalize" over all random effects, making predictions "marginal".
My impression is, terminology is consistent here for both the docs in glmmTMB and the vignette in modelbased, but the glmmTMB docs refer to predictions per observations, while estimite_means() gives you the averaged predictions, which then refer to "groups" or "samples".
I think @tjmahr is one of the few people in the world who is not confused by this mixed up terminology ;-)
(that said, I might be wrong here with my understanding of this topic)
Any prediction you get (NA or NULL) is a conditional effect/mean/prediction because it is the estimated value for a given observation. When we have a linear model, the NA conditional mean is also the marginal mean.
NA: typical value + 0 NULL: typical value + random intercept for each group population: Normal(typical value, random effect variance)
Because the mean of the population is the typical value, the NA conditional mean is also the marginal mean.
In a generalized model, there is a function that transforms from the linear model scale to the outcome scale so
NA: f(typical value + 0) NULL: f(typical value + random intercept for each group ) population: f(Normal(typical value, random effect variance))
The mean of the transformed normal distribution may not equal the NA conditional mean so we can’t call any of these a marginal mean.
To get the marginal mean on the outcome scale in the general case, you need average over (marginalize) f(Normal()) distribution somehow. is that what estimate_means() does?
Sorry for formatting. Bashed this out on my phone.
Thanks for the clarification! That's great!
is that what estimate_means() does?
By default, it produces emm's as returned by emmeans() (i.e. avg_predictions() is called with a specific datagrid). If you set estimate = "average", estimate_means() works like default avg_predictions(), marginalizing over the random effects.
Thanks very much for your detailed replies (and to Andrew Heiss for his excellent blog which I had read before but clearly haven't mastered).
My learning now moves on to what I see in https://marginaleffects.com/man/r/plot_predictions.html:
The condition argument is used to plot conditional predictions, that is, predictions made on a user-specified grid. This is analogous to using the newdata argument and datagrid() function in a predictions() call. All variables whose values are not specified explicitly are treated as usual by datagrid(), that is, they are held at their mean or mode (or rounded mean for integers). This includes grouping variables in mixed-effects models,...
To get a prediction for a specific grouping variable, I think I should pass in re.form=NA - but to be honest that gets me back to my initial confusion as I still read the glmmTMB guidance as suggesting NULL would be the correct choice.