margins
margins copied to clipboard
`at` and factors
Please specify whether your issue is about:
- [ ] a possible bug
- [X] a question about package functionality
- [ ] a suggested code or documentation change, improvement to the code, or feature request
What is the proper way to specify factor values in the at list?
library(margins)
dat <- mtcars
dat$cyl <- as.factor(dat$cyl)
mod <- lm(mpg ~ cyl + hp, dat)
margins(mod, at = list(hp = 100, cyl = '6'))
Yea, this has been a problem at times. I guess it needs to be character. We should add:
- [ ] Documentation of correct behavior
- [ ] Some tests (incl. when levels are numeric values as character, as in the
mtcarsexample)
Good to know. I'll try to take a look at this if/when I ever complete the plotting project.
Would you be interested in adding a bunch of input checks with something like this: https://github.com/mllg/checkmate
The levels are also being written as numbers rather than their labels:
m <- lm(Sepal.Length ~ Sepal.Width * Species, data = iris)
> margins(m, iris, at = list(Species = c("setosa", "versicolor")))
Average marginal effects at specified values
lm(formula = Sepal.Length ~ Sepal.Width * Species, data = iris)
at(Species) Sepal.Width Speciesversicolor
setosa 0.6905 1.435
versicolor 0.8651 1.435
> summary(.Last.value)
factor Species AME SE z p lower upper
Sepal.Width 1.0000 0.6905 0.1657 4.1664 0.0000 0.3657 1.0153
Sepal.Width 2.0000 0.8651 0.2002 4.3212 0.0000 0.4727 1.2575
Speciesversicolor 1.0000 1.4345 0.1217 11.7846 0.0000 1.1959 1.6731
Speciesversicolor 2.0000 1.4345 0.1217 11.7846 0.0000 1.1959 1.6731
which is unintuitive.
The problem is due to marginal effect(s) for a factor variable are determined partially by a way that contrasts are constructed. For example: if you estimated a model as in the example at the beginning of this thread, '4' is - let's say - internally set as a base level and this determines a set (and interpretation) of model coefficients: you have a coefficient for '6' and a coefficient for '8' describing a difference comparing to '4'. Now comes important question: what do you want to achieve setting cyl='6' in at? With regards to all the other predictors it's clear, but when it comes to cyl I may imagine two options:
- If you want to treat '6' as a reference level, that means you need to change a set of model coefficients. In such a case you want to have model coefficients (and respective marginal effects) for '4' and '8' (relative to '6'). Consequently you should re-parameterize model you use (with the simplest but not very efficient method of doing this being to estimate model on a data in which reference level of
cylwas changed to '6' withrelevel()) and compute marginal effects using that model, not the original one. - If you're not interested in
cylitself (but only in effect setting it to '6' have on marginal effects of other predictors) you may accept setting marginal effects related tocylto NAs.
First problem that must be solved is that build_datalist() with specified at regarding factor variable returns this variable as a factor with only one level - this on which at points out. As a consequence marginal_effects() simply can't return results in a form that will match model coefficients.
And perhaps dxdy.factor should get levels not from data but from (data that has been used in) a model?
I mean, if user don't specify neither data nor at argument while calling margins() these are the same data, however otherwise (with factors being coded differently - either in terms of reference level or in terms of type of contrasts) there can occur differences in set of columns in model matrices derived from this data and from a model (with model coefficients being corresponding to columns of model matrix).
Replacing code at the beginning of dydx.factor() with this below should work.
levs <- sort(unique(find_data(model)[[variable]]))
if (variable %in% names(attributes(data)$at)) {
base <- factor(attributes(data)$at[[variable]], levs)
} else {
base <- levs[1L]
}
levs <- levs[levs != base]
This should work even with current behavior of build_datalist().
There is one important limitation, however: if user specifies more than one value of a factor variable in at (like: margins(mod, at = list(hp = 100, cyl = c('6', '8')))) margins() will fail at the very end of its work trying to rbind() out because different elements in out will have different sets of columns.