performance
performance copied to clipboard
`check_model`: missing outlier plot when scaling data
Summary: check_model
fails to plot the outlier panel when scaling data because the scaled variables become incompatible matrix arrays.
Reprex: The following works:
library(performance)
m <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars)
check_model(m)
![](https://i.imgur.com/sDGgK7V.png)
Looks good. Let's scale the data
library(dplyr)
mtcars2 <- mtcars %>%
mutate(across(everything(), scale))
m2 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars2)
check_model(m2)
![](https://i.imgur.com/Bs7TBP1.png)
The outlier panel is missing. The reason is that the outlier check is failing silently.
check_model(m2, check = "outliers")
#> Error in unit(rep(0, TABLE_ROWS * dims[1]), "null"): 'x' and 'units' must have length > 0
The reason is that scaling changes the object class from numeric vector to matrix array.
class(mtcars2$mpg)
#> [1] "matrix" "array"
Solution is to change to vector or numeric
mtcars3 <- mtcars %>%
mutate(across(everything(), ~scale(.x) %>% as.numeric))
m3 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars3)
check_model(m3)
![](https://i.imgur.com/63Dm8qD.png)
mtcars4 <- mtcars %>%
mutate(across(everything(), ~scale(.x) %>% as.vector))
m4 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars4)
check_model(m4)
![](https://i.imgur.com/PWLN96D.png)
Note that scaling through lapply
instead of dplyr::mutate
works:
mtcars5 <- lapply(mtcars, scale) |> as.data.frame()
m5 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars5)
check_model(m5)
![](https://i.imgur.com/tyvLOf2.png)
The issue emerges also if one simply changes one variable only, suggesting the issue actually lies in the base R scale
function.
mtcars6 <- mtcars
mtcars6$wt <- scale(mtcars$wt)
m6 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars6)
check_model(m6)
![](https://i.imgur.com/RGvQjW5.png)
Created on 2022-06-11 by the reprex package (v2.0.1)
This is confusing many students in an introductory R stats class here because they are taught to scale their variables at the beginning of their script, but then the following fails. It would be nice if check_model
could automatically convert from matrix array to numeric vector, if applicable.
As a quick workaround, I would always recommend to use a standardize-function that preserves the vector class, e.g. datawizard::standardize()
.
I'll look into this, not sure where this exactly fails, because check_outliers()
seems to work.
The error comes from insight::get_predicted()
. For now, I added a warning. Not quite sure how to best fix this issue.
library(dplyr)
mtcars2 <- mtcars %>%
mutate(across(everything(), scale))
m2 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars2)
insight::get_predicted(m2)
#> Some of the variables were in matrix-format - probably you used
#> 'scale()' on your data?
#> If so, and you get an error, please try 'datawizard::standardize()' to
#> standardize your data.
#> Error: variables 'wt', 'cyl', 'gear', 'disp' were specified with different types from the fit
Created on 2022-06-12 by the reprex package (v2.0.1)
The issue is scale()
's terrible behavior of always returning a matrix. Users should just never use scale()
.
We could add a check to see if a predictor variable is a matrix and throw an error/warning like we do if a formula includes a $
?
Thank you, I like the warning and bwiernik's suggestion to throw an error also. Out of curiosity, would there be any con to automatically check if any variable is a matrix, and if so, convert to vector, with a similar warning about the conversion? Since it seems it hasn't been a problem for any of the other panels in check_model
.
We could add a check to see if a predictor variable is a matrix and throw an error/warning like we do if a formula includes a $?
The problem is that if get_predicted()
is called w/o data argument, get_data()
is called, which coerced matrix columns into vectors. scale()
causes no problem when called on-the-fly in the formula. If it's called before fitting the model, then the variable names in the data are the same as the original variable names, but the variable types are 1D-matrices. get_data()
returns a data frame where the variable names are also the same as in the original data, but data types are coerced into numeric. But predict()
expects the same type, probably because the names are identical?
At this point, it's difficult to check the original input type. I try to read the dataClasses
attribute of terms
, but not all model type have a terms()
method: https://github.com/easystats/insight/commit/216d735a860448d3e365dd457ab60f03c40dd82c
See example here to make a bit clearer what I described above.
library(insight)
library(dplyr)
mtcars2 <- mtcars %>%
mutate(across(everything(), scale))
m1 <- lm(scale(mpg) ~ scale(wt) + scale(cyl) + scale(gear) + scale(disp), data = mtcars)
# model frame contains scaled variables, including column names with "scale()"
model.frame(m1) |> str()
#> 'data.frame': 32 obs. of 5 variables:
#> $ scale(mpg) : num [1:32, 1] 0.151 0.151 0.45 0.217 -0.231 ...
#> ..- attr(*, "scaled:center")= num 20.1
#> ..- attr(*, "scaled:scale")= num 6.03
#> $ scale(wt) : num [1:32, 1] -0.6104 -0.3498 -0.917 -0.0023 0.2277 ...
#> ..- attr(*, "scaled:center")= num 3.22
#> ..- attr(*, "scaled:scale")= num 0.978
#> $ scale(cyl) : num [1:32, 1] -0.105 -0.105 -1.225 -0.105 1.015 ...
#> ..- attr(*, "scaled:center")= num 6.19
#> ..- attr(*, "scaled:scale")= num 1.79
#> $ scale(gear): num [1:32, 1] 0.424 0.424 0.424 -0.932 -0.932 ...
#> ..- attr(*, "scaled:center")= num 3.69
#> ..- attr(*, "scaled:scale")= num 0.738
#> $ scale(disp): num [1:32, 1] -0.571 -0.571 -0.99 0.22 1.043 ...
#> ..- attr(*, "scaled:center")= num 231
#> ..- attr(*, "scaled:scale")= num 124
#> ...
# get_data returns original data
get_data(m1) |> str()
#> 'data.frame': 32 obs. of 5 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
#> $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
#> $ disp: num 160 160 108 258 360 ...
#> ...
# get_predicted and predict work
get_predicted(m1)
#> Some of the variables were in matrix-format - probably you used
#> 'scale()' on your data?
#> If so, and you get an error, please try 'datawizard::standardize()' to
#> standardize your data.
#> Predicted values:
#>
#> [1] 0.32445260 0.16397650 1.04543816 0.14430212 -0.47187319 -0.04790433
#> [7] -0.55368454 0.54252269 0.56089726 -0.18283127 -0.18283127 -0.96536116
#> [13] -0.75139302 -0.78285892 -1.48188932 -1.60521739 -1.57854583 1.08719605
#> [19] 1.45189042 1.30814020 1.04950441 -0.57061221 -0.53325137 -0.73512269
#> [25] -0.68065788 1.25431100 1.09151240 1.45705867 -0.47507819 0.13139606
#> [31] -0.78441681 0.77093082
#>
#> NOTE: Confidence intervals, if available, are stored as attributes and can be accessed using `as.data.frame()` on this output.
predict(m1, newdata = get_data(m1))
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
#> 0.32445260 0.16397650 1.04543816 0.14430212
#> Hornet Sportabout Valiant Duster 360 Merc 240D
#> -0.47187319 -0.04790433 -0.55368454 0.54252269
#> Merc 230 Merc 280 Merc 280C Merc 450SE
#> 0.56089726 -0.18283127 -0.18283127 -0.96536116
#> Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
#> -0.75139302 -0.78285892 -1.48188932 -1.60521739
#> Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
#> -1.57854583 1.08719605 1.45189042 1.30814020
#> Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
#> 1.04950441 -0.57061221 -0.53325137 -0.73512269
#> Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
#> -0.68065788 1.25431100 1.09151240 1.45705867
#> Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
#> -0.47507819 0.13139606 -0.78441681 0.77093082
m2 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars2)
# model frame contains scaled variables, with variable names of original data
model.frame(m2) |> str()
#> 'data.frame': 32 obs. of 5 variables:
#> $ mpg : num [1:32, 1] 0.151 0.151 0.45 0.217 -0.231 ...
#> ..- attr(*, "scaled:center")= num 20.1
#> ..- attr(*, "scaled:scale")= num 6.03
#> $ wt : num [1:32, 1] -0.6104 -0.3498 -0.917 -0.0023 0.2277 ...
#> ..- attr(*, "scaled:center")= num 3.22
#> ..- attr(*, "scaled:scale")= num 0.978
#> $ cyl : num [1:32, 1] -0.105 -0.105 -1.225 -0.105 1.015 ...
#> ..- attr(*, "scaled:center")= num 6.19
#> ..- attr(*, "scaled:scale")= num 1.79
#> $ gear: num [1:32, 1] 0.424 0.424 0.424 -0.932 -0.932 ...
#> ..- attr(*, "scaled:center")= num 3.69
#> ..- attr(*, "scaled:scale")= num 0.738
#> $ disp: num [1:32, 1] -0.571 -0.571 -0.99 0.22 1.043 ...
#> ..- attr(*, "scaled:center")= num 231
#> ..- attr(*, "scaled:scale")= num 124
#> ...
# get_data returns data that was used to fit model (i.e. scaled variables),
# but coerces 1D-matrix to numeric vector
get_data(m2) |> str()
#> 'data.frame': 32 obs. of 5 variables:
#> $ mpg : num 0.151 0.151 0.45 0.217 -0.231 ...
#> $ wt : num -0.6104 -0.3498 -0.917 -0.0023 0.2277 ...
#> $ cyl : num -0.105 -0.105 -1.225 -0.105 1.015 ...
#> $ gear: num 0.424 0.424 0.424 -0.932 -0.932 ...
#> $ disp: num -0.571 -0.571 -0.99 0.22 1.043 ...
#> ...
# fails
get_predicted(m2)
#> Some of the variables were in matrix-format - probably you used
#> 'scale()' on your data?
#> If so, and you get an error, please try 'datawizard::standardize()' to
#> standardize your data.
#> Error: variables 'wt', 'cyl', 'gear', 'disp' were specified with different types from the fit
predict(m2, newdata = get_data(m2))
#> Error: variables 'wt', 'cyl', 'gear', 'disp' were specified with different types from the fit
Created on 2022-06-12 by the reprex package (v2.0.1)