performance `check_model`: missing outlier plot when scaling data

`check_model`: missing outlier plot when scaling data

Open rempsyc opened this issue 2 years ago • 5 comments

Summary: check_model fails to plot the outlier panel when scaling data because the scaled variables become incompatible matrix arrays.

Reprex: The following works:

library(performance)
m <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars)
check_model(m)

Looks good. Let's scale the data

library(dplyr)
mtcars2 <- mtcars %>%
  mutate(across(everything(), scale))

m2 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars2)
check_model(m2)

The outlier panel is missing. The reason is that the outlier check is failing silently.

check_model(m2, check = "outliers")
#> Error in unit(rep(0, TABLE_ROWS * dims[1]), "null"): 'x' and 'units' must have length > 0

The reason is that scaling changes the object class from numeric vector to matrix array.

class(mtcars2$mpg)
#> [1] "matrix" "array"

Solution is to change to vector or numeric

mtcars3 <- mtcars %>%
  mutate(across(everything(), ~scale(.x) %>% as.numeric))

m3 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars3)
check_model(m3)

mtcars4 <- mtcars %>%
  mutate(across(everything(), ~scale(.x) %>% as.vector))

m4 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars4)
check_model(m4)

Note that scaling through lapply instead of dplyr::mutate works:

mtcars5 <- lapply(mtcars, scale) |> as.data.frame()

m5 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars5)
check_model(m5)

The issue emerges also if one simply changes one variable only, suggesting the issue actually lies in the base R scale function.

mtcars6 <- mtcars
mtcars6$wt <- scale(mtcars$wt)

m6 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars6)
check_model(m6)

^{Created on 2022-06-11 by the reprex package (v2.0.1)}

This is confusing many students in an introductory R stats class here because they are taught to scale their variables at the beginning of their script, but then the following fails. It would be nice if check_model could automatically convert from matrix array to numeric vector, if applicable.

Jun 11 '22 19:06 rempsyc

As a quick workaround, I would always recommend to use a standardize-function that preserves the vector class, e.g. datawizard::standardize().

I'll look into this, not sure where this exactly fails, because check_outliers() seems to work.

Jun 11 '22 20:06 strengejacke

The error comes from insight::get_predicted(). For now, I added a warning. Not quite sure how to best fix this issue.

library(dplyr)
mtcars2 <- mtcars %>%
  mutate(across(everything(), scale))

m2 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars2)
insight::get_predicted(m2)
#> Some of the variables were in matrix-format - probably you used
#>   'scale()' on your data?
#>   If so, and you get an error, please try 'datawizard::standardize()' to
#>   standardize your data.
#> Error: variables 'wt', 'cyl', 'gear', 'disp' were specified with different types from the fit

^{Created on 2022-06-12 by the reprex package (v2.0.1)}

Jun 12 '22 18:06 strengejacke

The issue is scale()'s terrible behavior of always returning a matrix. Users should just never use scale().

We could add a check to see if a predictor variable is a matrix and throw an error/warning like we do if a formula includes a $?

Jun 12 '22 19:06 bwiernik

Thank you, I like the warning and bwiernik's suggestion to throw an error also. Out of curiosity, would there be any con to automatically check if any variable is a matrix, and if so, convert to vector, with a similar warning about the conversion? Since it seems it hasn't been a problem for any of the other panels in check_model.

Jun 12 '22 19:06 rempsyc

We could add a check to see if a predictor variable is a matrix and throw an error/warning like we do if a formula includes a $?

The problem is that if get_predicted() is called w/o data argument, get_data() is called, which coerced matrix columns into vectors. scale() causes no problem when called on-the-fly in the formula. If it's called before fitting the model, then the variable names in the data are the same as the original variable names, but the variable types are 1D-matrices. get_data() returns a data frame where the variable names are also the same as in the original data, but data types are coerced into numeric. But predict() expects the same type, probably because the names are identical?

At this point, it's difficult to check the original input type. I try to read the dataClasses attribute of terms, but not all model type have a terms() method: https://github.com/easystats/insight/commit/216d735a860448d3e365dd457ab60f03c40dd82c

See example here to make a bit clearer what I described above.

library(insight)
library(dplyr)
mtcars2 <- mtcars %>%
  mutate(across(everything(), scale))

m1 <- lm(scale(mpg) ~ scale(wt) + scale(cyl) + scale(gear) + scale(disp), data = mtcars)

# model frame contains scaled variables, including column names with "scale()"
model.frame(m1) |> str()
#> 'data.frame':    32 obs. of  5 variables:
#>  $ scale(mpg) : num [1:32, 1] 0.151 0.151 0.45 0.217 -0.231 ...
#>   ..- attr(*, "scaled:center")= num 20.1
#>   ..- attr(*, "scaled:scale")= num 6.03
#>  $ scale(wt)  : num [1:32, 1] -0.6104 -0.3498 -0.917 -0.0023 0.2277 ...
#>   ..- attr(*, "scaled:center")= num 3.22
#>   ..- attr(*, "scaled:scale")= num 0.978
#>  $ scale(cyl) : num [1:32, 1] -0.105 -0.105 -1.225 -0.105 1.015 ...
#>   ..- attr(*, "scaled:center")= num 6.19
#>   ..- attr(*, "scaled:scale")= num 1.79
#>  $ scale(gear): num [1:32, 1] 0.424 0.424 0.424 -0.932 -0.932 ...
#>   ..- attr(*, "scaled:center")= num 3.69
#>   ..- attr(*, "scaled:scale")= num 0.738
#>  $ scale(disp): num [1:32, 1] -0.571 -0.571 -0.99 0.22 1.043 ...
#>   ..- attr(*, "scaled:center")= num 231
#>   ..- attr(*, "scaled:scale")= num 124
#> ...

# get_data returns original data
get_data(m1) |> str()
#> 'data.frame':    32 obs. of  5 variables:
#>  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#>  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#>  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#>  $ disp: num  160 160 108 258 360 ...
#> ...

# get_predicted and predict work
get_predicted(m1)
#> Some of the variables were in matrix-format - probably you used
#>   'scale()' on your data?
#>   If so, and you get an error, please try 'datawizard::standardize()' to
#>   standardize your data.
#> Predicted values:
#> 
#>  [1]  0.32445260  0.16397650  1.04543816  0.14430212 -0.47187319 -0.04790433
#>  [7] -0.55368454  0.54252269  0.56089726 -0.18283127 -0.18283127 -0.96536116
#> [13] -0.75139302 -0.78285892 -1.48188932 -1.60521739 -1.57854583  1.08719605
#> [19]  1.45189042  1.30814020  1.04950441 -0.57061221 -0.53325137 -0.73512269
#> [25] -0.68065788  1.25431100  1.09151240  1.45705867 -0.47507819  0.13139606
#> [31] -0.78441681  0.77093082
#> 
#> NOTE: Confidence intervals, if available, are stored as attributes and can be accessed using `as.data.frame()` on this output.
predict(m1, newdata = get_data(m1))
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>          0.32445260          0.16397650          1.04543816          0.14430212 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>         -0.47187319         -0.04790433         -0.55368454          0.54252269 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>          0.56089726         -0.18283127         -0.18283127         -0.96536116 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>         -0.75139302         -0.78285892         -1.48188932         -1.60521739 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>         -1.57854583          1.08719605          1.45189042          1.30814020 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>          1.04950441         -0.57061221         -0.53325137         -0.73512269 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>         -0.68065788          1.25431100          1.09151240          1.45705867 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>         -0.47507819          0.13139606         -0.78441681          0.77093082


m2 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars2)

#  model frame contains scaled variables, with variable names of original data
model.frame(m2) |> str()
#> 'data.frame':    32 obs. of  5 variables:
#>  $ mpg : num [1:32, 1] 0.151 0.151 0.45 0.217 -0.231 ...
#>   ..- attr(*, "scaled:center")= num 20.1
#>   ..- attr(*, "scaled:scale")= num 6.03
#>  $ wt  : num [1:32, 1] -0.6104 -0.3498 -0.917 -0.0023 0.2277 ...
#>   ..- attr(*, "scaled:center")= num 3.22
#>   ..- attr(*, "scaled:scale")= num 0.978
#>  $ cyl : num [1:32, 1] -0.105 -0.105 -1.225 -0.105 1.015 ...
#>   ..- attr(*, "scaled:center")= num 6.19
#>   ..- attr(*, "scaled:scale")= num 1.79
#>  $ gear: num [1:32, 1] 0.424 0.424 0.424 -0.932 -0.932 ...
#>   ..- attr(*, "scaled:center")= num 3.69
#>   ..- attr(*, "scaled:scale")= num 0.738
#>  $ disp: num [1:32, 1] -0.571 -0.571 -0.99 0.22 1.043 ...
#>   ..- attr(*, "scaled:center")= num 231
#>   ..- attr(*, "scaled:scale")= num 124
#> ...

# get_data returns data that was used to fit model (i.e. scaled variables),
# but coerces 1D-matrix to numeric vector
get_data(m2) |> str()
#> 'data.frame':    32 obs. of  5 variables:
#>  $ mpg : num  0.151 0.151 0.45 0.217 -0.231 ...
#>  $ wt  : num  -0.6104 -0.3498 -0.917 -0.0023 0.2277 ...
#>  $ cyl : num  -0.105 -0.105 -1.225 -0.105 1.015 ...
#>  $ gear: num  0.424 0.424 0.424 -0.932 -0.932 ...
#>  $ disp: num  -0.571 -0.571 -0.99 0.22 1.043 ...
#> ...

# fails
get_predicted(m2)
#> Some of the variables were in matrix-format - probably you used
#>   'scale()' on your data?
#>   If so, and you get an error, please try 'datawizard::standardize()' to
#>   standardize your data.
#> Error: variables 'wt', 'cyl', 'gear', 'disp' were specified with different types from the fit
predict(m2, newdata = get_data(m2))
#> Error: variables 'wt', 'cyl', 'gear', 'disp' were specified with different types from the fit

^{Created on 2022-06-12 by the reprex package (v2.0.1)}

Jun 12 '22 21:06 strengejacke

performance performance copied to clipboard

`check_model`: missing outlier plot when scaling data

performance
performance copied to clipboard