insight
insight copied to clipboard
`get_data()`: labels with `factor()` in model formula
Can get_data()
preserve the label of a variable when it is wrapped in factor()
in the formula?
Notice that the mpg
variable retains its label, but not the cyl
factor, since the latter is wrapped in factor()
in the model formula.
library(haven)
library(insight)
dat <- mtcars
dat$mpg <- labelled(dat$mpg, label = "Miles per Gallon")
dat$cyl <- labelled(dat$cyl, label = "Cylinders")
mod <- lm(mpg ~ factor(cyl), dat)
get_data(mod) |> str()
#> 'data.frame': 32 obs. of 2 variables:
#> $ mpg: dbl+lbl [1:32] 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 1...
#> ..@ label: chr "Miles per Gallon"
#> $ cyl: num 6 6 4 6 8 6 8 4 4 6 ...
#> ..- attr(*, "factor")= logi TRUE
#> - attr(*, "terms")=Classes 'terms', 'formula' language mpg ~ factor(cyl)
#> .. ..- attr(*, "variables")= language list(mpg, factor(cyl))
#> .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#> .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. ..$ : chr [1:2] "mpg" "factor(cyl)"
#> .. .. .. ..$ : chr "factor(cyl)"
#> .. ..- attr(*, "term.labels")= chr "factor(cyl)"
#> .. ..- attr(*, "order")= int 1
#> .. ..- attr(*, "intercept")= int 1
#> .. ..- attr(*, "response")= int 1
#> .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> .. ..- attr(*, "predvars")= language list(mpg, factor(cyl))
#> .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "factor"
#> .. .. ..- attr(*, "names")= chr [1:2] "mpg" "factor(cyl)"
#> - attr(*, "factors")= chr "cyl"
#> - attr(*, "is_subset")= logi FALSE
No, the factor()
function strips labels. I'm not exactly following the use case? My understanding is that the labelled
class isn't intended to be something to actually be retained after import--such variables should either be converted to numeric or factor/ordered
This would be useful to automatically replace variable names by their label in {modelsummary}
tables: https://vincentarelbundock.github.io/modelsummary/articles/appearance.html#variable-labels. For now, simple variables in formula can be replaced by their labels but not those wrapped in factor()
in the formula
I'm not sure about the performance, but we could at this place: https://github.com/easystats/insight/blob/33e54687b04ec85f8b1d0430629f2e47fea5f010/R/utils_get_data.R#L483
recover the data frame the environment, match variable names for those variables that were coerced "on the fly" and then retrieve label-attributes from the recovered data.
Other possibility (but I don't know if expand.model.frame()
is supported by all models):
mtcars$cyl <- haven::labelled(mtcars$cyl, label = "Number of cylinders")
mtcars$hp <- haven::labelled(mtcars$hp, label = "Horsepower")
mtcars$am <- haven::labelled(mtcars$am, label = "Transmission")
mod <- lm(mpg ~ hp + factor(cyl) + factor(am), data = mtcars)
fac <- insight::find_terms(mod)$conditional
fac <- fac[startsWith(fac, "factor(")]
fac <- gsub("^factor\\(", "", fac)
fac <- gsub("\\)$", "", fac)
x <- expand.model.frame(mod, fac)[, fac]
lapply(x, class)
#> $cyl
#> [1] "haven_labelled" "vctrs_vctr" "double"
#>
#> $am
#> [1] "haven_labelled" "vctrs_vctr" "double"
Created on 2022-08-26 by the reprex package (v2.0.1)
I have vague memory that expand.model.frame
only works in a very limited set of models.
I'm not sure about the performance
My guess is that the main performance penalty would come from copying/assigning. I wonder if it is possible to retrieve the attribute directly from the environment, by name, without calling eval()
or re-assigning. Otherwise, I'm not sure it's worth the performance hit, since this is a relatively minor feature.
Closing now as this will be solved by the #691